Sunday, December 11, 2011

eDiscovery Processing Update

I’ve spent a lot of time building the QueueManager. In fact, I’ve spent more than twice as long on this part of my platform than I anticipated. However, I’m very happy with the results. I now have a fully functional, high availability, massively parallel system to submit, process and log tasks of any kind. I was also able to build a test harness (client application) to use all the new features. At this point I now have a test application that uses my core platform to:

  • Hash Files
  • Identify file types 
  • DeNist Files 
  • Move files from import location to production storage 
  • Update databases with file and storage info

Once each file has been identified and all the databases have been updated, I move the file to production storage. From there I submit a new task called DiscoverTask that, depending on the file type, will pull embedded items out of their parent file and add them to the system for further processing. This is the area that I am currently working on and can be a bit tricky. As an example of how this works, suppose you imported a Microsoft PST File (Outlook email storage file). The initial import would identify this as a true PST file and move it to production storage. From there, a new DiscoveryTask is created and dispatched to the QueueManager. Whichever processes picks up that task is responsible for opening the PST file and carving out more work items to the QueueManager. In this case, each individual email (and all of its metadata) is extracted and converted to a MSG format – keeping the child and parent relationships intact and updating the databases accordingly. Each MSG file is then checked (by another process that picked up a work item from the Queue) for embedded items and the process repeats itself. When processing PST, ZIP, RAR and other container files, it’s not uncommon to traverse dozens of levels in order to find and process all the embedded items. This is a simplified version of what it takes to process a file like this, but you get the idea. With PST files, I won’t just process the email files. I‘ll also be processing the Contacts, Calendar items, etc.

The eDiscovery business is evolving like crazy. Workloads are becoming much bigger and harder to manage and more and more users of eDiscovery want to bring this technology in house. I believe more than ever that moving eDiscovery products and services to the cloud is the right way to go.

Back to coding…

No comments:

Post a Comment