Thursday, December 29, 2011

File Types - Got Some Of The Easy Ones Done

This week has been a busy week for me. Christmas was awesome and spending time with family and friends was even better. However, I did spend a fair amount of time working on my eDiscovery platform. I’ve been focused on individual file types this week and have knocked quite a few off my list. I think I mentioned in a previous post that I was planning on creating an “add-in” architecture for my new File Handlers.

In a nutshell, each File Handler is able to perform tasks on a specific type of file. For example, I created an OpenOfficeHandler that will work on Office 2007 files (Word, PowerPoint, Excel) and later. I also created an ArchiveFileHandler that is responsible for traversing archives such as ZIP, RAR, TAR and other archive types. Now, each file type is responsible for a number of tasks, such as:

  • Extracting file-level metadata
  • Extracting and importing children or embedded files (recursively)
  • Extracting text
  • And more specific to the type of file
I’m currently working on my OLE Container handler that will be used for many binary file types that adhere to the OLE storage standards. I’m also looking into AutoCAD files and file types that are very similar to the way they are stored.

By the way, one of the hardest parts of writing this code is learning how the file types are stored on disk and how to go after the bits of interest. I had to laugh when I found the Microsoft Spec on how Open Office (XML) files are stored. The document was almost 6000 pages long. I realize Microsoft is the king of bloat, but 6000 pages? Really?

No comments:

Post a Comment