As I stated before, the QueueManager is the heart and soul of the processing platform. In my case, it will be the hub for more than just processing. I expect to use the queue to dispatch and distribute work items to all areas of my eDiscovery platform. For example, the Queue will be sent work items for file identification, text extraction, TIFFing, building search indexes and a whole slew of other tasks. In order for the queue to be able to process an ever-growing list of work items, it needs to be very robust and very fast.
I don’t want this blog to be too technical, but I need to go into some detail to explain why I do some of these things. First of all, most of my competition struggle when ingesting data of any size. They eventually get it done, but depending on the vendor and the size of the matter, this could take days. Truth be told, ingestion is the most CPU and disk I/O intense operation of the entire EDRM model, so it’s no wonder why it can take so long. However, when employing the correct architecture with the correct software, this time can be reduced dramatically.
I’m tempted to let company names of my competition fly as I describe this process, but I won’t (at least for now). Here’s a very simple example of how this process works. Let’s assume we are ingesting just one file for this example – a Word document. Here’s what it takes to correctly process this one document:
- Move document to work area
- Hash document for storage
- Verify document is really a Word Document
- Check to see if we can DeNist this file
- Check to see if this document has any children
- If so, extract child to work area and kick of the process from beginning
- If child has no text – OCR
- Extract all metadata from the Word Document
- Check for text layer within document
- If so, extract text and add to search index
- If not, OCR and “find” text
- Persist all parent/child relationships
- Update databases
- Move to next document
The above list is very simple and I have skipped over a lot of the smaller steps. Even so, you can see that a fair amount of work needs done with each file – and this was a very simple example. The above process holds true when discovering ZIP files with thousands of other files, or PST files that contain hundreds of thousands of emails. It becomes a very recursive process and can take a long time to complete when using the wrong architecture.
So, now you are probably wondering what I am doing different – glad you asked. First of all, I break down almost all of the steps above (and a lot more) into individual units of work – or what I call Tasks. Each task gets added to the Queue. Every processing server in my infrastructure asks the queue for a new Task. That task is then handed off to a new thread to start working on it. Depending on the number of cores in my servers, each server will have between 8 and 24 threads all working on tasks simultaneously. This allows multiple threads in multiple machines to work on the same PST for example – allowing hundreds of threads to swarm into the PST file and process individual MSG files (and their attachments). This architecture allows me to scale by simply adding more hardware. The software and database infrastructure is being designed to handle an incredible workload and this is one of the keys to keep things hitting on all cylinders.
The reason for this post was to talk a little bit about my QueueManager, but I got a little distracted with why I have a QueueManager in the first place. I’ll be talking a lot about the queue and how it works as I continue the development, but I am just about done with the first rough draft. Over the next couple of days, I should get a proof of concept running to tie all the pieces I have built so far. I will also be able to get some benchmarks at the same time.
For now, it’s time to get back to coding!