Thursday, December 29, 2011

File Types - Got Some Of The Easy Ones Done

This week has been a busy week for me. Christmas was awesome and spending time with family and friends was even better. However, I did spend a fair amount of time working on my eDiscovery platform. I’ve been focused on individual file types this week and have knocked quite a few off my list. I think I mentioned in a previous post that I was planning on creating an “add-in” architecture for my new File Handlers.

In a nutshell, each File Handler is able to perform tasks on a specific type of file. For example, I created an OpenOfficeHandler that will work on Office 2007 files (Word, PowerPoint, Excel) and later. I also created an ArchiveFileHandler that is responsible for traversing archives such as ZIP, RAR, TAR and other archive types. Now, each file type is responsible for a number of tasks, such as:

  • Extracting file-level metadata
  • Extracting and importing children or embedded files (recursively)
  • Extracting text
  • And more specific to the type of file
I’m currently working on my OLE Container handler that will be used for many binary file types that adhere to the OLE storage standards. I’m also looking into AutoCAD files and file types that are very similar to the way they are stored.

By the way, one of the hardest parts of writing this code is learning how the file types are stored on disk and how to go after the bits of interest. I had to laugh when I found the Microsoft Spec on how Open Office (XML) files are stored. The document was almost 6000 pages long. I realize Microsoft is the king of bloat, but 6000 pages? Really?

Monday, December 19, 2011

File Types and Metadata

I now have what I call the “Import Tasks” completed. These are all the tasks needed when first importing data into the eDiscovery platform. In fact, I call the first phase the ImportTask phase. This phase accomplishes everything I’ve already written about – such as deNist, file identification, file hashing, copying files to the production environment and so on. The next phase is the actual discovery phase. 

During the discovery phase, each file needs to have its metadata extracted and checked for children. Some files are pure containers such as ZIP, RAR, PST, MBOX, etc. but some files can be non-containers but still contain other files. Think of a word document that contains an Excel spreadsheet, or an email that has one or more attachments. The discovery phase is responsible for examining the source document to determine if there are any children documents contained within. For each child found, the system needs to create a new discovery task and add it to the queue for processing.

This is where things start to get a bit tricky. Each application stores files in different ways. Some are proprietary binary formats, some are OLE Containers and others don’t fall into any category at all. My approach here is to build an “add-in” architecture that allows me to build new add-ins when I need to add a new file type. For now, I am focusing on the top 100 files types that I see in the eDiscovery world. You’d be surprised how many applications use the same storage mechanism. For example, OLE Containers are used by lots of applications. This means that I can create one type of file handler and have it act on many other types of files.

I’m just getting started here, but so far so good. Some of this work is very tedious, but it has to be done. This is probably the main reason that most people buy off the shelf software for production. I will most likely discuss more about the files I am supporting later. I should also mention that within my 100 file types, I plan to support many Mac formats as well. In my opinion, the eDiscovery industry has tried to avoid Mac files. However, with Mac gaining more and more market share all the time, any good eDiscovery platform worth its weight needs to process at least the popular Mac files.

Friday, December 16, 2011

Why Don't I Have Investors? Well...

From the time I decided I wanted to create eDiscovery Processing and Early Case Assessment software, I’ve been asked by friends and readers why I am doing this on my own and not out with my Executive Summary in hand trying to raise the capital needed to jump start this business. Well, to be totally honest, I went down that road and had nothing but failure after failure. The real problem was twofold:

  1. The “money guys” that I talked to were from Angle Investment groups or standard VC firms. They knew nothing about the eDiscovery industry, so it was difficult to explain why this is a good idea, why now is the perfect time, and why I am the best one for the job, etc. Along with that, most of these guys are non-technical, so they just couldn’t get their arms around my vision. I kept being told that they were looking for “simple” businesses they could invest in and help steer.
  2. The second issue, and probably the biggest as I look back at it, was that I didn’t have anything. I had ideas, a business plan, some cool graphics and the like, but I did not have anything tangible that these guys could lock up as collateral if they decided to invest. I was constantly being asked how much intellectual property I had. It was just too big of a risk for these guys since everything was still in my head. 
 
So, after a couple of months of writing Executive Summaries and being rejected, I started realizing how much time I was wasting by doing something that was clearly outside my skillset. I decided to start this business on my own – one line of code at a time. I’m not working as fast as I would like since I have to keep my day job, but I’m making progress and am happy with my decision. Another decision I made is that if and when the day comes when I do find an investor, the investor will need to understand the eDisovery industry completely. A good fit would probably be a big law firm that wants to get into DIY eDiscovery, or a professional review company that may want to expand their offerings. I’m getting ahead of myself here, but I guess what I am saying is that I am looking for money and knowledge this time around, not just money.
 
The trip down the funding road was not all bad. I learned a lot about how this part of the business works and I am sure this newfound knowledge will come in handy someday.
 
 

Monday, December 12, 2011

Scalability Change

I originally wanted to design my eDiscovery Processing platform using several self-hosting services. This would allow me to isolate frequently used tasks such as File Identification, DeNisting, Hashing, Text Extraction, etc., and also offer those services as a SaaS service to other users (customers, competitors, etc.). During some rudimentary benchmarking, I found that my services were not scaling like I needed them too. I was seeing too many “wait states” for the above listed service. It makes sense now that I think about it. I have many machines all running between 8 and 16 threads all trying to do work at the same time. Having each thread consume these services was overwhelming the servers running the services.

I have since changed this and I now have each of these “services” included with each core process. This allows for better scalability since each new machine that gets added to the processing server pool will now have its own suite of services. Once changed and benchmarked, I saw an increase of 8X – not too shabby.

As for my SaaS service that I plan to expose to the outside world, well, these will just have to be written and implemented separately. All the code is the same, so it’s not the end of the world.

Sunday, December 11, 2011

eDiscovery Processing Update

I’ve spent a lot of time building the QueueManager. In fact, I’ve spent more than twice as long on this part of my platform than I anticipated. However, I’m very happy with the results. I now have a fully functional, high availability, massively parallel system to submit, process and log tasks of any kind. I was also able to build a test harness (client application) to use all the new features. At this point I now have a test application that uses my core platform to:

  • Hash Files
  • Identify file types 
  • DeNist Files 
  • Move files from import location to production storage 
  • Update databases with file and storage info

Once each file has been identified and all the databases have been updated, I move the file to production storage. From there I submit a new task called DiscoverTask that, depending on the file type, will pull embedded items out of their parent file and add them to the system for further processing. This is the area that I am currently working on and can be a bit tricky. As an example of how this works, suppose you imported a Microsoft PST File (Outlook email storage file). The initial import would identify this as a true PST file and move it to production storage. From there, a new DiscoveryTask is created and dispatched to the QueueManager. Whichever processes picks up that task is responsible for opening the PST file and carving out more work items to the QueueManager. In this case, each individual email (and all of its metadata) is extracted and converted to a MSG format – keeping the child and parent relationships intact and updating the databases accordingly. Each MSG file is then checked (by another process that picked up a work item from the Queue) for embedded items and the process repeats itself. When processing PST, ZIP, RAR and other container files, it’s not uncommon to traverse dozens of levels in order to find and process all the embedded items. This is a simplified version of what it takes to process a file like this, but you get the idea. With PST files, I won’t just process the email files. I‘ll also be processing the Contacts, Calendar items, etc.

The eDiscovery business is evolving like crazy. Workloads are becoming much bigger and harder to manage and more and more users of eDiscovery want to bring this technology in house. I believe more than ever that moving eDiscovery products and services to the cloud is the right way to go.

Back to coding…

Monday, December 5, 2011

Why Am I Doing This?


I’ve been asked why I am doing this – in fact, I’ve asked myself that same question a few times at 3AM while working on this project. The fact is I love to create products and services that provide value. When it comes right down to it, that’s what drives me. Yes, I plan to make a ton of money at this too, but the money is not the driver.

I have said this before, but I see a lot of opportunity in the eDiscovery industry. The industry continues to evolve and get bigger and bigger in size. Not only are more companies needing eDiscovery services, but the cases are getting bigger and more complicated as well. This is one of the things that excite me because most of the current providers out there are failing to keep up with ever-growing workloads and complexity and I plan to take advantage of that.

Couple this with the fact that eDiscovery will be moving to the Cloud whether you like it or not, and you have the fire that keeps me up late at night working on this project. It’s a huge undertaking and it will take a lot of time and commitment to do it right, but THIS is the time to build this. My hope is to eventually be able to bring in some help – both at a technical level and a business level, but until I get a base infrastructure built, it’s just me – and I love every minute of it.

Friday, December 2, 2011

QueueManager - The Hub of eDiscovery Processing (the right way)

As I stated before, the QueueManager is the heart and soul of the processing platform. In my case, it will be the hub for more than just processing. I expect to use the queue to dispatch and distribute work items to all areas of my eDiscovery platform. For example, the Queue will be sent work items for file identification, text extraction, TIFFing, building search indexes and a whole slew of other tasks. In order for the queue to be able to process an ever-growing list of work items, it needs to be very robust and very fast.

I don’t want this blog to be too technical, but I need to go into some detail to explain why I do some of these things. First of all, most of my competition struggle when ingesting data of any size. They eventually get it done, but depending on the vendor and the size of the matter, this could take days. Truth be told, ingestion is the most CPU and disk I/O intense operation of the entire EDRM model, so it’s no wonder why it can take so long. However, when employing the correct architecture with the correct software, this time can be reduced dramatically.

I’m tempted to let company names of my competition fly as I describe this process, but I won’t (at least for now). Here’s a very simple example of how this process works.  Let’s assume we are ingesting just one file for this example – a Word document. Here’s what it takes to correctly process this one document:
  
  • Move document to work area 
  • Hash document for storage 
  • Verify document is really a Word Document 
  • Check to see if we can DeNist this file 
  • Check to see if this document has any children 
  •        If so, extract child to work area and kick of the process from beginning
  •        If child has no text – OCR 
  • Extract all metadata from the Word Document 
  • Check for text layer within document 
  •        If so, extract text and add to search index
  •        If not, OCR and “find” text 
  • Persist all parent/child relationships 
  •  Update databases 
  •  Move to next document 
  
The above list is very simple and I have skipped over a lot of the smaller steps. Even so, you can see that a fair amount of work needs done with each file – and this was a very simple example. The above process holds true when discovering ZIP files with thousands of other files, or PST files that contain hundreds of thousands of emails. It becomes a very recursive process and can take a long time to complete when using the wrong architecture.

So, now you are probably wondering what I am doing different – glad you asked. First of all, I break down almost all of the steps above (and a lot more) into individual units of work – or what I call Tasks. Each task gets added to the Queue. Every processing server in my infrastructure asks the queue for a new Task. That task is then handed off to a new thread to start working on it. Depending on the number of cores in my servers, each server will have between 8 and 24 threads all working on tasks simultaneously. This allows multiple threads in multiple machines to work on the same PST for example – allowing hundreds of threads to swarm into the PST file and process individual MSG files (and their attachments). This architecture allows me to scale by simply adding more hardware. The software and database infrastructure is being designed to handle an incredible workload and this is one of the keys to keep things hitting on all cylinders.

The reason for this post was to talk a little bit about my QueueManager, but I got a little distracted with why I have a QueueManager in the first place. I’ll be talking a lot about the queue and how it works as I continue the development, but I am just about done with the first rough draft. Over the next couple of days, I should get a proof of concept running to tie all the pieces I have built so far. I will also be able to get some benchmarks at the same time.

For now, it’s time to get back to coding!