My eDiscovery Project: 2011

Thursday, December 29, 2011

File Types - Got Some Of The Easy Ones Done

This week has been a busy week for me. Christmas was awesome and spending time with family and friends was even better. However, I did spend a fair amount of time working on my eDiscovery platform. I’ve been focused on individual file types this week and have knocked quite a few off my list. I think I mentioned in a previous post that I was planning on creating an “add-in” architecture for my new File Handlers.

In a nutshell, each File Handler is able to perform tasks on a specific type of file. For example, I created an OpenOfficeHandler that will work on Office 2007 files (Word, PowerPoint, Excel) and later. I also created an ArchiveFileHandler that is responsible for traversing archives such as ZIP, RAR, TAR and other archive types. Now, each file type is responsible for a number of tasks, such as:

Extracting file-level metadata
Extracting and importing children or embedded files (recursively)
Extracting text
And more specific to the type of file

I’m currently working on my OLE Container handler that will be used for many binary file types that adhere to the OLE storage standards. I’m also looking into AutoCAD files and file types that are very similar to the way they are stored.

By the way, one of the hardest parts of writing this code is learning how the file types are stored on disk and how to go after the bits of interest. I had to laugh when I found the Microsoft Spec on how Open Office (XML) files are stored. The document was almost 6000 pages long. I realize Microsoft is the king of bloat, but 6000 pages? Really?

Monday, December 19, 2011

File Types and Metadata

I now have what I call the “Import Tasks” completed. These are all the tasks needed when first importing data into the eDiscovery platform. In fact, I call the first phase the ImportTask phase. This phase accomplishes everything I’ve already written about – such as deNist, file identification, file hashing, copying files to the production environment and so on. The next phase is the actual discovery phase.

During the discovery phase, each file needs to have its metadata extracted and checked for children. Some files are pure containers such as ZIP, RAR, PST, MBOX, etc. but some files can be non-containers but still contain other files. Think of a word document that contains an Excel spreadsheet, or an email that has one or more attachments. The discovery phase is responsible for examining the source document to determine if there are any children documents contained within. For each child found, the system needs to create a new discovery task and add it to the queue for processing.

This is where things start to get a bit tricky. Each application stores files in different ways. Some are proprietary binary formats, some are OLE Containers and others don’t fall into any category at all. My approach here is to build an “add-in” architecture that allows me to build new add-ins when I need to add a new file type. For now, I am focusing on the top 100 files types that I see in the eDiscovery world. You’d be surprised how many applications use the same storage mechanism. For example, OLE Containers are used by lots of applications. This means that I can create one type of file handler and have it act on many other types of files.

I’m just getting started here, but so far so good. Some of this work is very tedious, but it has to be done. This is probably the main reason that most people buy off the shelf software for production. I will most likely discuss more about the files I am supporting later. I should also mention that within my 100 file types, I plan to support many Mac formats as well. In my opinion, the eDiscovery industry has tried to avoid Mac files. However, with Mac gaining more and more market share all the time, any good eDiscovery platform worth its weight needs to process at least the popular Mac files.

Friday, December 16, 2011

Why Don't I Have Investors? Well...

From the time I decided I wanted to create eDiscovery Processing and Early Case Assessment software, I’ve been asked by friends and readers why I am doing this on my own and not out with my Executive Summary in hand trying to raise the capital needed to jump start this business. Well, to be totally honest, I went down that road and had nothing but failure after failure. The real problem was twofold:

The “money guys” that I talked to were from Angle Investment groups or standard VC firms. They knew nothing about the eDiscovery industry, so it was difficult to explain why this is a good idea, why now is the perfect time, and why I am the best one for the job, etc. Along with that, most of these guys are non-technical, so they just couldn’t get their arms around my vision. I kept being told that they were looking for “simple” businesses they could invest in and help steer.
The second issue, and probably the biggest as I look back at it, was that I didn’t have anything. I had ideas, a business plan, some cool graphics and the like, but I did not have anything tangible that these guys could lock up as collateral if they decided to invest. I was constantly being asked how much intellectual property I had. It was just too big of a risk for these guys since everything was still in my head.

So, after a couple of months of writing Executive Summaries and being rejected, I started realizing how much time I was wasting by doing something that was clearly outside my skillset. I decided to start this business on my own – one line of code at a time. I’m not working as fast as I would like since I have to keep my day job, but I’m making progress and am happy with my decision. Another decision I made is that if and when the day comes when I do find an investor, the investor will need to understand the eDisovery industry completely. A good fit would probably be a big law firm that wants to get into DIY eDiscovery, or a professional review company that may want to expand their offerings. I’m getting ahead of myself here, but I guess what I am saying is that I am looking for money and knowledge this time around, not just money.

The trip down the funding road was not all bad. I learned a lot about how this part of the business works and I am sure this newfound knowledge will come in handy someday.

Monday, December 12, 2011

Scalability Change

I originally wanted to design my eDiscovery Processing platform using several self-hosting services. This would allow me to isolate frequently used tasks such as File Identification, DeNisting, Hashing, Text Extraction, etc., and also offer those services as a SaaS service to other users (customers, competitors, etc.). During some rudimentary benchmarking, I found that my services were not scaling like I needed them too. I was seeing too many “wait states” for the above listed service. It makes sense now that I think about it. I have many machines all running between 8 and 16 threads all trying to do work at the same time. Having each thread consume these services was overwhelming the servers running the services.

I have since changed this and I now have each of these “services” included with each core process. This allows for better scalability since each new machine that gets added to the processing server pool will now have its own suite of services. Once changed and benchmarked, I saw an increase of 8X – not too shabby.

As for my SaaS service that I plan to expose to the outside world, well, these will just have to be written and implemented separately. All the code is the same, so it’s not the end of the world.

Sunday, December 11, 2011

eDiscovery Processing Update

I’ve spent a lot of time building the QueueManager. In fact, I’ve spent more than twice as long on this part of my platform than I anticipated. However, I’m very happy with the results. I now have a fully functional, high availability, massively parallel system to submit, process and log tasks of any kind. I was also able to build a test harness (client application) to use all the new features. At this point I now have a test application that uses my core platform to:

Hash Files
Identify file types
DeNist Files
Move files from import location to production storage
Update databases with file and storage info

Once each file has been identified and all the databases have been updated, I move the file to production storage. From there I submit a new task called DiscoverTask that, depending on the file type, will pull embedded items out of their parent file and add them to the system for further processing. This is the area that I am currently working on and can be a bit tricky. As an example of how this works, suppose you imported a Microsoft PST File (Outlook email storage file). The initial import would identify this as a true PST file and move it to production storage. From there, a new DiscoveryTask is created and dispatched to the QueueManager. Whichever processes picks up that task is responsible for opening the PST file and carving out more work items to the QueueManager. In this case, each individual email (and all of its metadata) is extracted and converted to a MSG format – keeping the child and parent relationships intact and updating the databases accordingly. Each MSG file is then checked (by another process that picked up a work item from the Queue) for embedded items and the process repeats itself. When processing PST, ZIP, RAR and other container files, it’s not uncommon to traverse dozens of levels in order to find and process all the embedded items. This is a simplified version of what it takes to process a file like this, but you get the idea. With PST files, I won’t just process the email files. I‘ll also be processing the Contacts, Calendar items, etc.

The eDiscovery business is evolving like crazy. Workloads are becoming much bigger and harder to manage and more and more users of eDiscovery want to bring this technology in house. I believe more than ever that moving eDiscovery products and services to the cloud is the right way to go.

Back to coding…

Monday, December 5, 2011

Why Am I Doing This?

I’ve been asked why I am doing this – in fact, I’ve asked myself that same question a few times at 3AM while working on this project. The fact is I love to create products and services that provide value. When it comes right down to it, that’s what drives me. Yes, I plan to make a ton of money at this too, but the money is not the driver.

I have said this before, but I see a lot of opportunity in the eDiscovery industry. The industry continues to evolve and get bigger and bigger in size. Not only are more companies needing eDiscovery services, but the cases are getting bigger and more complicated as well. This is one of the things that excite me because most of the current providers out there are failing to keep up with ever-growing workloads and complexity and I plan to take advantage of that.

Couple this with the fact that eDiscovery will be moving to the Cloud whether you like it or not, and you have the fire that keeps me up late at night working on this project. It’s a huge undertaking and it will take a lot of time and commitment to do it right, but THIS is the time to build this. My hope is to eventually be able to bring in some help – both at a technical level and a business level, but until I get a base infrastructure built, it’s just me – and I love every minute of it.

Friday, December 2, 2011

QueueManager - The Hub of eDiscovery Processing (the right way)

As I stated before, the QueueManager is the heart and soul of the processing platform. In my case, it will be the hub for more than just processing. I expect to use the queue to dispatch and distribute work items to all areas of my eDiscovery platform. For example, the Queue will be sent work items for file identification, text extraction, TIFFing, building search indexes and a whole slew of other tasks. In order for the queue to be able to process an ever-growing list of work items, it needs to be very robust and very fast.

I don’t want this blog to be too technical, but I need to go into some detail to explain why I do some of these things. First of all, most of my competition struggle when ingesting data of any size. They eventually get it done, but depending on the vendor and the size of the matter, this could take days. Truth be told, ingestion is the most CPU and disk I/O intense operation of the entire EDRM model, so it’s no wonder why it can take so long. However, when employing the correct architecture with the correct software, this time can be reduced dramatically.

I’m tempted to let company names of my competition fly as I describe this process, but I won’t (at least for now). Here’s a very simple example of how this process works. Let’s assume we are ingesting just one file for this example – a Word document. Here’s what it takes to correctly process this one document:

Move document to work area
Hash document for storage
Verify document is really a Word Document
Check to see if we can DeNist this file
Check to see if this document has any children
If so, extract child to work area and kick of the process from beginning
If child has no text – OCR
Extract all metadata from the Word Document
Check for text layer within document
If so, extract text and add to search index
If not, OCR and “find” text
Persist all parent/child relationships
Update databases
Move to next document

The above list is very simple and I have skipped over a lot of the smaller steps. Even so, you can see that a fair amount of work needs done with each file – and this was a very simple example. The above process holds true when discovering ZIP files with thousands of other files, or PST files that contain hundreds of thousands of emails. It becomes a very recursive process and can take a long time to complete when using the wrong architecture.

So, now you are probably wondering what I am doing different – glad you asked. First of all, I break down almost all of the steps above (and a lot more) into individual units of work – or what I call Tasks. Each task gets added to the Queue. Every processing server in my infrastructure asks the queue for a new Task. That task is then handed off to a new thread to start working on it. Depending on the number of cores in my servers, each server will have between 8 and 24 threads all working on tasks simultaneously. This allows multiple threads in multiple machines to work on the same PST for example – allowing hundreds of threads to swarm into the PST file and process individual MSG files (and their attachments). This architecture allows me to scale by simply adding more hardware. The software and database infrastructure is being designed to handle an incredible workload and this is one of the keys to keep things hitting on all cylinders.

The reason for this post was to talk a little bit about my QueueManager, but I got a little distracted with why I have a QueueManager in the first place. I’ll be talking a lot about the queue and how it works as I continue the development, but I am just about done with the first rough draft. Over the next couple of days, I should get a proof of concept running to tie all the pieces I have built so far. I will also be able to get some benchmarks at the same time.

For now, it’s time to get back to coding!

Wednesday, November 30, 2011

File Hashing Service

In order to use my new NIST service, I have to submit a hash value for the file I want to identify. That’s where my new File Hash service comes into play. I added a bunch of convenience methods to this service so I can hash new files, look up cached hashes and compare hashes. At its core, I submit a file path that I wish to hash, the service then builds a SHA1 hash of that file by opening a file stream and reading in blocks of bytes until I get to the end of the file. The hash created is then saved in my Storage table along with all the other particulars of the file. Because I now have a unique signature for this file, I can also be sure I only save this file to my working project one time.

Opening up a file stream is a bit slower than just reading all bytes of the file into memory at one time, but the latter does not scale well. During the discovery process, my processing platform will encounter files ranging from one byte up to many gigabytes. In a multithreaded automated process, reading all that data into memory could have disastrous consequences. So, I sacrifice a bit of speed for stability and scalability.

At this point, I have a File Identification Service, a De NIST Service and a File Hashing Service. It feels good to make progress and cross items off my whiteboard. Next on the list: Queue Manager Service. This service will be more involved and have a lot more moving parts.

To Cloud Or Not To Cloud?

...that is the question so many service providers are asking themselves now.

Many of my friends know that I am working on this project, but none of them know anything about eDiscovery or what it takes write great processing and Early Case Assessment (ECA) software. However, most of my buddies are techies like me and understand business. It’s funny to see their faces and hear their comments when I say something like “eDiscovery is headed to the cloud”. See, most other industries are either already in the cloud, or at least taking the elevator to get there. It’s foreign to them when they find out that eDiscovery is dragging its feet kicking and screaming. That’s where this blog and project come in. I believe there is a huge opportunity to capture the business that I believe is already available for a cloud-based solution. I also believe it’s just a matter of time before corporations and law firms are expecting most, if not all, of the eDiscovery phases to be conducted in the cloud. Speed and price are two of the factors that will win the war to move this industry to the cloud.

The next 24 to 36 months will be an interesting transition in this space. I expect to see more people like me who believe the cloud is the future of eDiscovery creating competing products and services. I also expect to see the “big players” of the industry figure out that they are behind the curve and scramble to come up with a way to straddle the line of traditional infrastructure and cloud-based infrastructure. It should make for interesting times!

De NIST Service

I just finished implementing my De NIST service. This was really easy since most of the work is being done by the National Institute of Standards and Technology. They maintain a database of more than 20 Million unique SHA1 Hashes. Each hash corresponds to a common file. In the eDiscovery industry, files that are identified in the NIST database have no value in an eDiscovery case. These files could be anything from standard system files to vendor print drivers to just about anything you can think of. Once identified by NIST, the have no value to us. Finding these files and weeding them out saves a lot of resources down the road.

My eDiscovery platform hashes each file it discovers (more on this in a future post) and then compares those hashes to the NIST database. Anything found gets marked as “deNisted” and moved off so that no other work is performed on that file. Files that are not found in the NIST database will be available for further discovery and processing.

Another service scratched off the TODO list. I now need to create my Hash service which will allow me to hash any file. I will use this to compare it against the NIST database and, more importantly, I will use it to ensure I only save each file (unique hash) one time. This will be used in the De Duplication process that I will get to later.

Thursday, November 24, 2011

eDiscovery File Identification

I just finished writing my File Identification service that my processing platform will use to identify binary files. This is one of the first steps in e-discovery processing since you have to know what the file type is before you know what (if anything) you can do with it. For example, some files have a text layer that needs to be extracted and indexed, some files contain vector graphics that need processing and still other files may be containers that contain other files. Correctly identifying the file is critical.

Too bad it’s not as easy as just looking up the file extension to determine the correct file type. My discovery processing engine needs to be able to tell what type of file it is dealing with even if it has been renamed or is part of a crash dump. There are many ways to accomplish this, but the one I chose is the “magic number system”.

The disadvantage to this system is that it’s designed to be used on binary files only – html, xml and text files be damned! It’s really pretty simple, I open up a byte stream loading in a bunch of bytes, and I then compare several byte groups at predefined offsets and look them up in my “known file types” database. File types that are found are returned and the system moves on to the next task. File types that are not found are marked as such and stored in a special exception location. This allows me to go back and find files that are not being identified and updated my file type database manually – yes, it can be a bit tedious, but it’s the best and most accurate way I have found so far.

Again, this was implemented as a WCF service and will be located in my “Core Services” server farm in my private cloud. Currently my cloud is my development environment, but once I get a bit further along, I’ll start shopping for a colo facility where I can house all my own equipment. I’ll have more on my cloud infrastructure later.

Well, the first service is done and I am moving on to the next – which will probably be my DeNist service.

Wednesday, November 23, 2011

Services, Services And More Services

I’ve identified a number of core services that will need to be exposed to the rest of the platform. Many of these services could also be used outside the platform as well. Because of this, I’ve decided to write these services as WCF Binary Services. Utilizing this model will allow me to consume these services from the platform as well as any other client I create that may be able to use the functionality these services provide. It also gives me a lot of flexibility on how these services are hosted and implemented. For example, my guess is that these will be self-hosted services running with a Windows Service that I create. One of the benefits of this is that I can easily load-balance these services across multiple servers within my infrastructure. However, if speed becomes an issue, I can always just include these services directly in the client. Again, it’s about being flexible and saleable.

Here is a list of the services I will be working on first:

File Identification Service – Identify the type of a given file
File Hashing Service – Generating hashes and comparing hashes for individual files
NIST Service – Identify files that are found in NIST
Queue Manager Service – Add, remove and process task within the queue

Tuesday, November 22, 2011

Picking The Right Tool For The Job

My goal is to write the entire e-discovery processing platform in C#. From the initial set of requirements I created so far, I should be able to accomplish this. I’ve been writing C# since it was first released as a Beta many moons ago, but being proficient in the language is just one reason to for me to use it. I am also looking ahead and know that I will not be the only one creating and maintaining this code if things go right. It will be much easier for me to find C# developers than it will be to find C++ (my other core language) developers. Also, when you get right down to the differences between managed and unmanaged code, I argue that some managed code runs faster than native code. I can hear the C++ fans booing me already, but here are a couple of examples:

When running managed code the JIT compiler compiles it just before it is needed on the hosting machine. When running native code, the code is compiled at design time and needs to be compiled down to very generic machine code so it will run on a variety of systems. Having the JIT compiler do the work only when it is needed allows for optimization in the code that you would not otherwise have. In other words, managed code can be optimized on the fly where native code cannot.
Multiple threads. By using managed code, I can write code that utilizes multiple threads and let the runtime figure out how many threads to use based on how many cores are available. This allows me to write code once and target a huge variety of systems. This goes back to my scalability discussion. Since I am processing everything in parallel, I may have a few massive machines pulling work items from my queue at the same time I have some not-so-fast machines pulling work items from the queue. The optimal number of threads running on any machine can be decided at runtime instead of compile time – a big win!

Now, I’m writing a modular platform, so if it makes sense to write native code for a specific plug-in or feature, then I won’t hesitate to do it. For now though, I’m fairly certain I can get most of this project done using managed code.

Enough theory – it’s time to roll up my sleeves and get started. I’ve identified several core services that my processing platform will need to expose, so I will be working on those over the next few days. Once I get a few of the services out of the way, I will write my first Queue Manager so I can actually do something interesting with what I have written up to that point.

Sunday, November 20, 2011

The Big Three

I’m approaching this project like every other project I’ve ever worked on - beginning with the end in mind. The three most important aspects of processing software in the ediscovery space are accuracy, speed and scalability – or as I like to say, ASS.

Accuracy is pretty self-explanatory – you just can’t be wrong when processing data. For example, identifying file types is one of the first things any good processing software needs to do. If the file is misidentified, all later processing done on that file could be useless. Another example may be a file that was identified to contain no text. In this case, the text would never make it to the search indexes – effectively eliminating any chance of the document being found during keyword searches in the Early Case Assessment (ECA) phase of the project.

Speed is critical in this business. It’s not uncommon for a service provider to process tens of millions of files in a single case. E-Discovery software must be optimized to get through all stages of the discovery process as fast as possible. I’ll be talking more about this in future posts, but know there are many ways my competition “fudge” the numbers to make it look as though they are discovering terabytes of data in mere minutes – not true I tell you and I will explain how this slight of hand works later. For now understand that I plan to process data the “right” way and will do it in a massively parallel way to keep my customers happy. After all, they are who I need to keep happy.

Scalability means many things to many people, so let me clarify what I mean and why I am going down this road. First of all, without getting into too much detail about how the latest generation of computer CPUs work, understand that all new servers now days have multiple cores per CPU, and you can have multiple CPUs per server. Software that is written to take advantage of this type of hardware architecture is said to take advantage of parallel processing – this is just a fancy way of saying that the software on a given machine can do multiple things at the same time if written correctly. More specifically, software can do as many simultaneous tasks as the server has cores. Pretty simple, right? My architecture will take this one step further by allowing for scalability by adding more servers to the mix. As it turns out, this is a very cost effective way of scaling out the production environment since I can take advantage of everything each server has to offer. A positive side-effect of this is that it also builds redundancy into my production environment – something I have planned for another post as I get further along.

All three of these important pieces will be managed by a “processing Queue” that I am just now starting to develop. This will be the heart and soul of the processing software and will most likely go through many iterations as the project matures. I know this because I have been writing software long enough to know that an important piece like this will never get done right the first time. As more requirements are flushed out, the more work the queue will need to do. Fun stuff!

Friday, November 18, 2011

And So It Begins...

I think it was about 18 months ago when I decided I was going to create this e-discovery business. It’s something that I understand and have had success with in the past (albeit working for the man). However, it was not until I started flowcharting some of the design processes over the last couple of weeks that I realized just how BIG this project was going to be. I will admit as I took a step back from my whiteboard and took it all in, I was a bit overwhelmed – well, more than a bit, but I’m trying to stay calm as I write this post.

I’m an object oriented guy and have started breaking these “objects” into manageable pieces. Over the next few days/weeks, I’ll be outlining these pieces and how they fit into the bigger picture. Since my aim is to build this business around the SasS model and running everything in the cloud, I will be creating modules that handle specific tasks. More to come…

Tuesday, November 15, 2011

E-Discovery - What's That?!?

Anyone stumbling upon my blog will most likely already know what E-Discovery is, but here is the Reader’s Digest version for those that may not know…

E-Discovery is the process of taking mass amounts Electronically Stored Information (ESI) and making it available in a common format for lawyers, paralegals and other litigation support specialist that wish to work with the data. By using software we can “convert” just about every file type on a person’s computer (also known as a custodian) to a common format so users of the software can then run text searches, skin-tone analysis, predictive models and a whole slew of other processes that I will cover in this blog. I’m not a lawyer (and I won’t be playing on in this blog), but all the user is really doing is gaining a better understanding of how good or bad the case may be for them – allowing them to make educated decisions on how (or if) to pursue the case. This is very over-simplified, so here is more information.

I have ventured down the path of writing the software the does the heavy lifting. My focus (at least for now) will be in the following three areas:

Processing – the act of getting data in a standard format described above
Early Case Assessment – client facing software allowing quick culling and review of a case or project
Cloud-based everything! – This is what sets me apart from my competition. Everything will be designed to run in the cloud. Crazy, I know…