Friday, March 22, 2013

eDiscovery File Type Handlers


I stated in my last post that I got a bit sidetracked during the last several months, but I’m back at it in full speed.  At the heart of all eDiscovery processing engines is the ability to process file types.  By “process” I am talking about the ability to:

  • Extract Metadata
  • Extract Text
  • Convert to Intermediary Format (PDF in my case)
  • Convert to TIFF/JPG and Other Images
  • OCR Non-textual Layers
  • And more 

There are many ways to architect a system to handle the above list of tasks, and I have experimented with them all.  I have decided to go with a pluggable architecture which will allow me to add new file handlers over time without having to worry about recompiling the entire engine.  In most cases this is never an issue, especially when I plan to run this from a cloud environment.  However, if I ever decide to license my engine, I’ll want a way to distribute updates and cause as little interference as possible.   

I have listed the file types that I have completed below.  I am using the term “completed” rather loosely here.  Every file handler below knows how to perform the above list of tasks with the exception of OCRing.  There are many occasions where I decided to write my own widget from scratch instead of paying someone licensing fees to use a widget that is already written.  OCR is not one of them.  I am testing multiple OCR solutions and will eventually pick one of them to work with.  I’ll update my findings in a future post. 

So, the list below represents all the file handlers currently supported by my eDiscovery engine.  A note on Excel and Word documents:  I decided to go back as far as Office 97 for these.  That means that, at present, I cannot process Word or Excel documents earlier than 97.  With limited resources and time, I felt I needed to draw a line somewhere, and this is where it was drawn.  Later, if the need arises, I will create Office handlers for pre-97 documents – again, it’s pluggable, so why not J

 

File Handler List:

 



PPT, PPTX, PDF, PCL, XPS
 
JPG, PNG, BMP, EMF, GIF, WMF, ICON, TIFF, SVG, EXIF
 
DOC, DOCX, TXT, OOXML, ODT, RTF, HTML, XHTML, MHTML
 
PST, OST

MSG, EML, ICS, VCF

XLS, XLSX, XLSM, XLTX, XLTM

ZIP, RAR, 7Z, TAR, GZ, GZIP 
 
I’d say the above list includes most of the documents found on Windows-based collection in the eDiscovery world, but there are a few still missing that will be added in good time.  My short list contains AutoCAD, MBox, Microsoft Visio and Microsoft Project.  Those file handlers will finish my “must have” list.  Although, I will never, ever be done with adding file handlers.  So many new file types, so little time!
 
Also, I spent extra time on MSG files.  Everyone knows that MSG files are Outlook mail messages.  However, they can also be Tasks, vCards and iCal items.  When my system extracts items from a PST or OST, it determines the actual type (vCard, iCal, etc.) and saves the native in its true format.  However, if the system comes across loose files MSG files that were either saved by the user, or extracted from another tool, they are still evaluated to determine the correct type.  This is just one of the small things I do up front to ensure high quality exports and deliverable on the backend – of course, I need to finish this before anyone can actually use it J
 
Now that my “Must Have” list is (almost) complete, it’s time to start working on the client (GUI) side.  My goal over the next several weeks is to get a completed processing engine up and running that will support the above list of files.   This includes the Graphical User Interface (GUI), the services to support it, additional functionality to the core components such as TIFF, filtering, keyword searching, etc.
 
Stay tuned…


No comments:

Post a Comment