My eDiscovery Project: File Types and Metadata

Monday, December 19, 2011

File Types and Metadata

I now have what I call the “Import Tasks” completed. These are all the tasks needed when first importing data into the eDiscovery platform. In fact, I call the first phase the ImportTask phase. This phase accomplishes everything I’ve already written about – such as deNist, file identification, file hashing, copying files to the production environment and so on. The next phase is the actual discovery phase.

During the discovery phase, each file needs to have its metadata extracted and checked for children. Some files are pure containers such as ZIP, RAR, PST, MBOX, etc. but some files can be non-containers but still contain other files. Think of a word document that contains an Excel spreadsheet, or an email that has one or more attachments. The discovery phase is responsible for examining the source document to determine if there are any children documents contained within. For each child found, the system needs to create a new discovery task and add it to the queue for processing.

This is where things start to get a bit tricky. Each application stores files in different ways. Some are proprietary binary formats, some are OLE Containers and others don’t fall into any category at all. My approach here is to build an “add-in” architecture that allows me to build new add-ins when I need to add a new file type. For now, I am focusing on the top 100 files types that I see in the eDiscovery world. You’d be surprised how many applications use the same storage mechanism. For example, OLE Containers are used by lots of applications. This means that I can create one type of file handler and have it act on many other types of files.

I’m just getting started here, but so far so good. Some of this work is very tedious, but it has to be done. This is probably the main reason that most people buy off the shelf software for production. I will most likely discuss more about the files I am supporting later. I should also mention that within my 100 file types, I plan to support many Mac formats as well. In my opinion, the eDiscovery industry has tried to avoid Mac files. However, with Mac gaining more and more market share all the time, any good eDiscovery platform worth its weight needs to process at least the popular Mac files.

4 comments:

AnonymousDecember 28, 2011 at 11:19 AM
So why are you doing all this on your own when companies like Advanced Discovery, kCura, Law, Nuix and a bunch of other companies could use your skillset? Just seems like you could really write your own ticket.
ReplyDelete
Replies
-CoderDecember 29, 2011 at 12:13 PM
To be quite honest, I'm just tired of being an employee. If I could pick my perfect scenario, it would be finding a small startup eDiscovery company where I would be part owner. I would help steer the technology side of the business while the real smart guys would steer the business side of the company. Basically, I want more skin in the game than just a paycheck. Do you know any eDiscovery startups? That would be fun :-)
ReplyDelete
Replies
UnknownJune 19, 2012 at 2:55 AM
how to find metadata fields if u don't have any native application
ReplyDelete
Replies
-CoderMarch 19, 2013 at 10:04 AM
The short answer is you don't. You have to have the native file if you want to extrac metadata.
ReplyDelete
Replies

Add comment