In a nutshell, each File Handler is able to perform tasks on a specific type of file. For example, I created an OpenOfficeHandler that will work on Office 2007 files (Word, PowerPoint, Excel) and later. I also created an ArchiveFileHandler that is responsible for traversing archives such as ZIP, RAR, TAR and other archive types. Now, each file type is responsible for a number of tasks, such as:
- Extracting file-level metadata
- Extracting and importing children or embedded files (recursively)
- Extracting text
- And more specific to the type of file
By the way, one of the hardest parts of writing this code is learning how the file types are stored on disk and how to go after the bits of interest. I had to laugh when I found the Microsoft Spec on how Open Office (XML) files are stored. The document was almost 6000 pages long. I realize Microsoft is the king of bloat, but 6000 pages? Really?
No comments:
Post a Comment