I’m approaching this project like every other project I’ve ever worked on - beginning with the end in mind. The three most important aspects of processing software in the ediscovery space are accuracy, speed and scalability – or as I like to say, ASS.
Accuracy is pretty self-explanatory – you just can’t be wrong when processing data. For example, identifying file types is one of the first things any good processing software needs to do. If the file is misidentified, all later processing done on that file could be useless. Another example may be a file that was identified to contain no text. In this case, the text would never make it to the search indexes – effectively eliminating any chance of the document being found during keyword searches in the Early Case Assessment (ECA) phase of the project.
Speed is critical in this business. It’s not uncommon for a service provider to process tens of millions of files in a single case. E-Discovery software must be optimized to get through all stages of the discovery process as fast as possible. I’ll be talking more about this in future posts, but know there are many ways my competition “fudge” the numbers to make it look as though they are discovering terabytes of data in mere minutes – not true I tell you and I will explain how this slight of hand works later. For now understand that I plan to process data the “right” way and will do it in a massively parallel way to keep my customers happy. After all, they are who I need to keep happy.
Scalability means many things to many people, so let me clarify what I mean and why I am going down this road. First of all, without getting into too much detail about how the latest generation of computer CPUs work, understand that all new servers now days have multiple cores per CPU, and you can have multiple CPUs per server. Software that is written to take advantage of this type of hardware architecture is said to take advantage of parallel processing – this is just a fancy way of saying that the software on a given machine can do multiple things at the same time if written correctly. More specifically, software can do as many simultaneous tasks as the server has cores. Pretty simple, right? My architecture will take this one step further by allowing for scalability by adding more servers to the mix. As it turns out, this is a very cost effective way of scaling out the production environment since I can take advantage of everything each server has to offer. A positive side-effect of this is that it also builds redundancy into my production environment – something I have planned for another post as I get further along.
All three of these important pieces will be managed by a “processing Queue” that I am just now starting to develop. This will be the heart and soul of the processing software and will most likely go through many iterations as the project matures. I know this because I have been writing software long enough to know that an important piece like this will never get done right the first time. As more requirements are flushed out, the more work the queue will need to do. Fun stuff!
No comments:
Post a Comment