I stated in my last post that I got a bit sidetracked during the last several months, but I’m back at it in full speed. At the heart of all eDiscovery processing engines is the ability to process file types. By “process” I am talking about the ability to:
- Extract Metadata
- Extract Text
- Convert to Intermediary Format (PDF in my case)
- Convert to TIFF/JPG and Other Images
- OCR Non-textual Layers
- And more
There are many ways to architect a system to handle the above list of tasks, and I have experimented with them all. I have decided to go with a pluggable architecture which will allow me to add new file handlers over time without having to worry about recompiling the entire engine. In most cases this is never an issue, especially when I plan to run this from a cloud environment. However, if I ever decide to license my engine, I’ll want a way to distribute updates and cause as little interference as possible.
I have listed the file types that I have completed below. I am using the term “completed” rather loosely here. Every file handler below knows how to perform the above list of tasks with the exception of OCRing. There are many occasions where I decided to write my own widget from scratch instead of paying someone licensing fees to use a widget that is already written. OCR is not one of them. I am testing multiple OCR solutions and will eventually pick one of them to work with. I’ll update my findings in a future post.
So, the list below represents all the file handlers currently supported by my eDiscovery engine. A note on Excel and Word documents: I decided to go back as far as Office 97 for these. That means that, at present, I cannot process Word or Excel documents earlier than 97. With limited resources and time, I felt I needed to draw a line somewhere, and this is where it was drawn. Later, if the need arises, I will create Office handlers for pre-97 documents – again, it’s pluggable, so why not J
File Handler List:
PPT, PPTX, PDF, PCL, XPS
JPG, PNG, BMP, EMF, GIF, WMF, ICON, TIFF, SVG, EXIF
DOC, DOCX, TXT, OOXML, ODT, RTF, HTML, XHTML, MHTML
PST, OST
MSG, EML, ICS, VCF
XLS, XLSX, XLSM, XLTX, XLTM
ZIP, RAR, 7Z, TAR, GZ, GZIP
MSG, EML, ICS, VCF
XLS, XLSX, XLSM, XLTX, XLTM
ZIP, RAR, 7Z, TAR, GZ, GZIP
I’d say the above
list includes most of the documents found on Windows-based collection in the
eDiscovery world, but there are a few still missing that will be added in good time. My short list contains AutoCAD, MBox, Microsoft
Visio and Microsoft Project. Those file
handlers will finish my “must have” list.
Although, I will never, ever be done with adding file handlers. So many new file types, so little time!
Also, I spent extra
time on MSG files. Everyone knows that
MSG files are Outlook mail messages. However,
they can also be Tasks, vCards and iCal items.
When my system extracts items from a PST or OST, it determines the
actual type (vCard, iCal, etc.) and saves the native in its true format. However, if the system comes across loose
files MSG files that were either saved by the user, or extracted from another
tool, they are still evaluated to determine the correct type. This is just one of the small things I do up front
to ensure high quality exports and deliverable on the backend – of course, I
need to finish this before anyone can actually use it J
Now that my “Must
Have” list is (almost) complete, it’s time to start working on the client (GUI)
side. My goal over the next several
weeks is to get a completed processing engine up and running that will support
the above list of files. This includes
the Graphical User Interface (GUI), the services to support it, additional
functionality to the core components such as TIFF, filtering, keyword
searching, etc.
Stay tuned…
No comments:
Post a Comment