My eDiscovery Project

Wednesday, April 10, 2013

Fear Is Slowing Me Down

I’m still working on the UI portion of my eDiscovery engine. However, it’s been a few days since I’ve made any real progress. I’ve been busy with my day job and home life, but the more I think about it, I don’t believe my current lack of progress has anything to do with how busy I have been lately. Instead, I think fear is playing a huge role here. Combined with what I call “Development Dip”, it’s really become something I need to deal with in order to move on.

First off, the fear comes from all the what-ifs I keep asking myself. What if nobody buys my software? What if I finish and find that I need to target a different market? What if it’s not as good as I think it is? What if have to tell my family that I spent all these months for nothing? What if I have to give up because it’s too big for one person to pull off? I could go on and on with what-ifs.

The “Development Dip”, as I like to call it, is part of the process for me. When starting on a new project and outlining the requirements, it’s exciting. When building out the architecture and figuring out what pieces are going to go where and how the system will need to scale over time, that’s a lot of fun too. Once all of that stuff is out of the way, the coding starts. This is where the rubber meets the road. This is where all the late nights and weekends come in to play. I use an Agile development process and tweak it to work for me. The idea is to work on the current “items” list and get them done in a certain period of time. However, since there is no way to know 100% of all the requirements of an application like this up front, there are many times when things will not work as originally designed. This is normal and part of the development process. I just need to create a new backlog item and add it to the pool of items to be completed. Anyway, the Development Dip, for me, comes when I start to realize my forward progress is slower than the growth of my items backlog list – in other words, each day I finish coding (or night in my case), I find myself with more work to do than when I started. This sometimes gives me an overwhelmed feeling or a feeling that I am moving backwards.

So, when I keep hearing myself say things like “what if nobody will buy my software?” and my TODO list continues to outgrow my DONE list, it does not take long to realize why my productivity is suffering. I’m literally talking myself out of doing the work that needs to be done. I’m finding “reasons” not to complete my project.

When I look at this objectively, I know that these are the times when I need to keep pushing to get to my next milestone. As each milestone gets completed, it helps fuel the mission to the next. I need to reevaluate how I’m setting my milestones and making sure they are not too far apart. I guess I just need more mini-victories along this journey to stay engaged and energized.

Onwards and upwards…

Monday, April 8, 2013

Excel Spreadsheets - Not Fun!

Excel Spreadsheets are by far the hardest files to work with in the eDiscovery industry. People use Excel in ways Microsoft never intended. The versatility of the document format combined with the ingenuity (I could use other adjectives here) of the users creates some pretty crazy documents that eventually find themselves in eDiscovery processing software.

Which brings me to the reason for this post. During some mass testing last week, I found that in certain circumstances I was unable to find hidden sheets, rows or columns. First of all, my processing engine is designed to find hidden areas in Excel sheets and documents. I am currently able to identify hidden sheets, columns, rows and cells. However, during my testing I found that when I had to extract text from Excel documents that were larger than 400,000 pages, I was missing anything hidden (those of you not in this industry would be shocked at how large some of these Excel documents can get). Since I am not using Automation for Excel documents and instead relying on the binary structure of the document, I figured it would be an easy find. Well, I did find the problem, but it took a very long time. Turns out it was a bug in Microsoft’s file format on very large documents. I was able to see the problem and work around it, but it has slowed my progress.

As you know I started out last week working on the UI for my engine. That came to a screeching halt once I found this issue. As promised, I plan to share the good, bad and ugly during this process. This was mostly bad with a pinch of ugly thrown in. I’m back to testing the pieces of the UI that I have completed and am back on track.

Saturday, March 30, 2013

User Interface (UI) - It Begins

Before I talk too much about the user interface, I wanted to take a step back and talk about how I have (and will continue to) architected the processing engine. I’ve tried to keep the engine, or Core as I like to call it, very modular. This allows me to create new file types, filters, etc. and add them easily to the system. However, the Core is a bunch of code that will run on an Agent machine. Agent machines are the work horses that actually do the processing. They will be controlled by a client application that does nothing more than dispatch work and allow the operator visibility into what is going on in the system at any given point. Since the Agent machines are not interacted with directly, there is no reason to build a UI around Core. In fact, Core will be running as a Windows Service on the Agent machines. Each Agent machine will communicate with a couple of services in order to discover new work to perform as well as report statistical data – I will discuss how all this works in more detail in a later post.

This brings me to the reason for this post. I started out building a simple UI that is responsible for the following:

Dispatch work (ingestion, text extraction, index build, convert native to intermediate, image, etc.) to Agent machines

Allow operator to configure “settings” for each task before submitting to Agents

Create new projects, custodians and tasks

Monitor the status of Agent machines

Manage workload by moving Agents from group to group

Create and apply filters (DeDupe, keyword, embedded items, RegEx, etc.)

And a whole lot more

I used the above list (and a whole lot more) as my requirements document when I started building the UI. One of the things I had always struggled with on this project was which technology and platform should I use for the UI? Since the UI is considered a “thin client” because all the real work is being done by the Agent machines and services, I was wide open to picking a client technology. For example, I could run the entire system from one of the following:

Windows Forms application – the old-school standard desktop application

Windows WPF application – Windows Presentation Foundation. Very similar to Forms, but newer and more agile

Web-based application – Silverlight or HTML5 browser based application – could be run from Intranet, Extranet, or even Internet if needed

Mobile Application – iPad, iPhone or Android based application

I quickly realized that I want to use all of the above client platforms and don’t want to be restricted if my needs change later on. However, the thought of having to recreate the client framework every time I decided to add a new client application was a bit overwhelming. The answer was Application Programming Interface or API. By creating a Client API, I can write all the critical client side code one time and have each client application use the API in order to “dispatch” the real work. This way, I can focus on the look and feel of the client application instead of reinventing the wheel every time I create a new project, add a custodian, apply a filter, etc. I can hear some of the devs out there saying, but wait, you can’t use the same API for Windows and iOS – what gives? More on that later, but just know that part of my API is a service architecture using binary services.

Unfortunately for me, I did not figure this out quickly enough, so I had a bunch of re-work to do. Not a big deal, but if I had a dollar for every line of code I have ever replaced with another line of code over the years, I’d be writing to you from my own private island.

I am no longer restricted on what client technology I want to use. For example, I can use WPF for the desktop application and I can create an Executive Dashboard on an iPad. The possibilities are endless. I’ve decided to create my primary UI using WPF for this project. Creating a UI that requires an API is the right choice, but it will slow down the development a bit. It complicates the design slightly and for every non-UI function the client needs to perform, I have to write and interface it with the matching method in the API – the good news is that once I get the first client completed, the second client will go much quicker sine the API is already done!

The WPF UI is coming along pretty nicely. I’m using very rich UI controls to keep the look and feel very fresh and intuitive. I’m no artist, but I’m pretty good at creating an intuitive UI that you don’t need to go and get certified to use (not mentioning any names).

More to come…

Friday, March 22, 2013

eDiscovery File Type Handlers

I stated in my last post that I got a bit sidetracked during the last several months, but I’m back at it in full speed. At the heart of all eDiscovery processing engines is the ability to process file types. By “process” I am talking about the ability to:

Extract Metadata
Extract Text
Convert to Intermediary Format (PDF in my case)
Convert to TIFF/JPG and Other Images
OCR Non-textual Layers
And more

There are many ways to architect a system to handle the above list of tasks, and I have experimented with them all. I have decided to go with a pluggable architecture which will allow me to add new file handlers over time without having to worry about recompiling the entire engine. In most cases this is never an issue, especially when I plan to run this from a cloud environment. However, if I ever decide to license my engine, I’ll want a way to distribute updates and cause as little interference as possible.

I have listed the file types that I have completed below. I am using the term “completed” rather loosely here. Every file handler below knows how to perform the above list of tasks with the exception of OCRing. There are many occasions where I decided to write my own widget from scratch instead of paying someone licensing fees to use a widget that is already written. OCR is not one of them. I am testing multiple OCR solutions and will eventually pick one of them to work with. I’ll update my findings in a future post.

So, the list below represents all the file handlers currently supported by my eDiscovery engine. A note on Excel and Word documents: I decided to go back as far as Office 97 for these. That means that, at present, I cannot process Word or Excel documents earlier than 97. With limited resources and time, I felt I needed to draw a line somewhere, and this is where it was drawn. Later, if the need arises, I will create Office handlers for pre-97 documents – again, it’s pluggable, so why not J

File Handler List:

PPT, PPTX, PDF, PCL, XPS

JPG, PNG, BMP, EMF, GIF, WMF, ICON, TIFF, SVG, EXIF

DOC, DOCX, TXT, OOXML, ODT, RTF, HTML, XHTML, MHTML

PST, OST

MSG, EML, ICS, VCF

XLS, XLSX, XLSM, XLTX, XLTM

ZIP, RAR, 7Z, TAR, GZ, GZIP

I’d say the above list includes most of the documents found on Windows-based collection in the eDiscovery world, but there are a few still missing that will be added in good time. My short list contains AutoCAD, MBox, Microsoft Visio and Microsoft Project. Those file handlers will finish my “must have” list. Although, I will never, ever be done with adding file handlers. So many new file types, so little time!

Also, I spent extra time on MSG files. Everyone knows that MSG files are Outlook mail messages. However, they can also be Tasks, vCards and iCal items. When my system extracts items from a PST or OST, it determines the actual type (vCard, iCal, etc.) and saves the native in its true format. However, if the system comes across loose files MSG files that were either saved by the user, or extracted from another tool, they are still evaluated to determine the correct type. This is just one of the small things I do up front to ensure high quality exports and deliverable on the backend – of course, I need to finish this before anyone can actually use it J

Now that my “Must Have” list is (almost) complete, it’s time to start working on the client (GUI) side. My goal over the next several weeks is to get a completed processing engine up and running that will support the above list of files. This includes the Graphical User Interface (GUI), the services to support it, additional functionality to the core components such as TIFF, filtering, keyword searching, etc.

Stay tuned…

Tuesday, March 19, 2013

I'm Back!

Yes, it has been more than a full year since my last post – pretty disappointed about that, but it is what it is. My personal schedule was crazy last year, and since I can only work on this project during my own time, it suffered quite a bit.

Where am I now? As of the beginning of February, I have been working hard to get the core components completed. I’ve made some significant progress and will spend some time on that during my next post. I wanted to update this blog for a couple of reasons. 1. I have been getting a few emails asking me where I went and did I give up. 2. I do this to keep myself motivated. It is sometimes overwhelming when I look at how much work I still have to do and journaling my progress seems to help elevate some of the stress.

Thursday, December 29, 2011

File Types - Got Some Of The Easy Ones Done

This week has been a busy week for me. Christmas was awesome and spending time with family and friends was even better. However, I did spend a fair amount of time working on my eDiscovery platform. I’ve been focused on individual file types this week and have knocked quite a few off my list. I think I mentioned in a previous post that I was planning on creating an “add-in” architecture for my new File Handlers.

In a nutshell, each File Handler is able to perform tasks on a specific type of file. For example, I created an OpenOfficeHandler that will work on Office 2007 files (Word, PowerPoint, Excel) and later. I also created an ArchiveFileHandler that is responsible for traversing archives such as ZIP, RAR, TAR and other archive types. Now, each file type is responsible for a number of tasks, such as:

Extracting file-level metadata
Extracting and importing children or embedded files (recursively)
Extracting text
And more specific to the type of file

I’m currently working on my OLE Container handler that will be used for many binary file types that adhere to the OLE storage standards. I’m also looking into AutoCAD files and file types that are very similar to the way they are stored.

By the way, one of the hardest parts of writing this code is learning how the file types are stored on disk and how to go after the bits of interest. I had to laugh when I found the Microsoft Spec on how Open Office (XML) files are stored. The document was almost 6000 pages long. I realize Microsoft is the king of bloat, but 6000 pages? Really?

Monday, December 19, 2011

File Types and Metadata

I now have what I call the “Import Tasks” completed. These are all the tasks needed when first importing data into the eDiscovery platform. In fact, I call the first phase the ImportTask phase. This phase accomplishes everything I’ve already written about – such as deNist, file identification, file hashing, copying files to the production environment and so on. The next phase is the actual discovery phase.

During the discovery phase, each file needs to have its metadata extracted and checked for children. Some files are pure containers such as ZIP, RAR, PST, MBOX, etc. but some files can be non-containers but still contain other files. Think of a word document that contains an Excel spreadsheet, or an email that has one or more attachments. The discovery phase is responsible for examining the source document to determine if there are any children documents contained within. For each child found, the system needs to create a new discovery task and add it to the queue for processing.

This is where things start to get a bit tricky. Each application stores files in different ways. Some are proprietary binary formats, some are OLE Containers and others don’t fall into any category at all. My approach here is to build an “add-in” architecture that allows me to build new add-ins when I need to add a new file type. For now, I am focusing on the top 100 files types that I see in the eDiscovery world. You’d be surprised how many applications use the same storage mechanism. For example, OLE Containers are used by lots of applications. This means that I can create one type of file handler and have it act on many other types of files.

I’m just getting started here, but so far so good. Some of this work is very tedious, but it has to be done. This is probably the main reason that most people buy off the shelf software for production. I will most likely discuss more about the files I am supporting later. I should also mention that within my 100 file types, I plan to support many Mac formats as well. In my opinion, the eDiscovery industry has tried to avoid Mac files. However, with Mac gaining more and more market share all the time, any good eDiscovery platform worth its weight needs to process at least the popular Mac files.

Friday, December 16, 2011

Why Don't I Have Investors? Well...

From the time I decided I wanted to create eDiscovery Processing and Early Case Assessment software, I’ve been asked by friends and readers why I am doing this on my own and not out with my Executive Summary in hand trying to raise the capital needed to jump start this business. Well, to be totally honest, I went down that road and had nothing but failure after failure. The real problem was twofold:

The “money guys” that I talked to were from Angle Investment groups or standard VC firms. They knew nothing about the eDiscovery industry, so it was difficult to explain why this is a good idea, why now is the perfect time, and why I am the best one for the job, etc. Along with that, most of these guys are non-technical, so they just couldn’t get their arms around my vision. I kept being told that they were looking for “simple” businesses they could invest in and help steer.
The second issue, and probably the biggest as I look back at it, was that I didn’t have anything. I had ideas, a business plan, some cool graphics and the like, but I did not have anything tangible that these guys could lock up as collateral if they decided to invest. I was constantly being asked how much intellectual property I had. It was just too big of a risk for these guys since everything was still in my head.

So, after a couple of months of writing Executive Summaries and being rejected, I started realizing how much time I was wasting by doing something that was clearly outside my skillset. I decided to start this business on my own – one line of code at a time. I’m not working as fast as I would like since I have to keep my day job, but I’m making progress and am happy with my decision. Another decision I made is that if and when the day comes when I do find an investor, the investor will need to understand the eDisovery industry completely. A good fit would probably be a big law firm that wants to get into DIY eDiscovery, or a professional review company that may want to expand their offerings. I’m getting ahead of myself here, but I guess what I am saying is that I am looking for money and knowledge this time around, not just money.

The trip down the funding road was not all bad. I learned a lot about how this part of the business works and I am sure this newfound knowledge will come in handy someday.

Monday, December 12, 2011

Scalability Change

I originally wanted to design my eDiscovery Processing platform using several self-hosting services. This would allow me to isolate frequently used tasks such as File Identification, DeNisting, Hashing, Text Extraction, etc., and also offer those services as a SaaS service to other users (customers, competitors, etc.). During some rudimentary benchmarking, I found that my services were not scaling like I needed them too. I was seeing too many “wait states” for the above listed service. It makes sense now that I think about it. I have many machines all running between 8 and 16 threads all trying to do work at the same time. Having each thread consume these services was overwhelming the servers running the services.

I have since changed this and I now have each of these “services” included with each core process. This allows for better scalability since each new machine that gets added to the processing server pool will now have its own suite of services. Once changed and benchmarked, I saw an increase of 8X – not too shabby.

As for my SaaS service that I plan to expose to the outside world, well, these will just have to be written and implemented separately. All the code is the same, so it’s not the end of the world.

Sunday, December 11, 2011

eDiscovery Processing Update

I’ve spent a lot of time building the QueueManager. In fact, I’ve spent more than twice as long on this part of my platform than I anticipated. However, I’m very happy with the results. I now have a fully functional, high availability, massively parallel system to submit, process and log tasks of any kind. I was also able to build a test harness (client application) to use all the new features. At this point I now have a test application that uses my core platform to:

Hash Files
Identify file types
DeNist Files
Move files from import location to production storage
Update databases with file and storage info

Once each file has been identified and all the databases have been updated, I move the file to production storage. From there I submit a new task called DiscoverTask that, depending on the file type, will pull embedded items out of their parent file and add them to the system for further processing. This is the area that I am currently working on and can be a bit tricky. As an example of how this works, suppose you imported a Microsoft PST File (Outlook email storage file). The initial import would identify this as a true PST file and move it to production storage. From there, a new DiscoveryTask is created and dispatched to the QueueManager. Whichever processes picks up that task is responsible for opening the PST file and carving out more work items to the QueueManager. In this case, each individual email (and all of its metadata) is extracted and converted to a MSG format – keeping the child and parent relationships intact and updating the databases accordingly. Each MSG file is then checked (by another process that picked up a work item from the Queue) for embedded items and the process repeats itself. When processing PST, ZIP, RAR and other container files, it’s not uncommon to traverse dozens of levels in order to find and process all the embedded items. This is a simplified version of what it takes to process a file like this, but you get the idea. With PST files, I won’t just process the email files. I‘ll also be processing the Contacts, Calendar items, etc.

The eDiscovery business is evolving like crazy. Workloads are becoming much bigger and harder to manage and more and more users of eDiscovery want to bring this technology in house. I believe more than ever that moving eDiscovery products and services to the cloud is the right way to go.

Back to coding…

Monday, December 5, 2011

Why Am I Doing This?

I’ve been asked why I am doing this – in fact, I’ve asked myself that same question a few times at 3AM while working on this project. The fact is I love to create products and services that provide value. When it comes right down to it, that’s what drives me. Yes, I plan to make a ton of money at this too, but the money is not the driver.

I have said this before, but I see a lot of opportunity in the eDiscovery industry. The industry continues to evolve and get bigger and bigger in size. Not only are more companies needing eDiscovery services, but the cases are getting bigger and more complicated as well. This is one of the things that excite me because most of the current providers out there are failing to keep up with ever-growing workloads and complexity and I plan to take advantage of that.

Couple this with the fact that eDiscovery will be moving to the Cloud whether you like it or not, and you have the fire that keeps me up late at night working on this project. It’s a huge undertaking and it will take a lot of time and commitment to do it right, but THIS is the time to build this. My hope is to eventually be able to bring in some help – both at a technical level and a business level, but until I get a base infrastructure built, it’s just me – and I love every minute of it.

Friday, December 2, 2011

QueueManager - The Hub of eDiscovery Processing (the right way)

As I stated before, the QueueManager is the heart and soul of the processing platform. In my case, it will be the hub for more than just processing. I expect to use the queue to dispatch and distribute work items to all areas of my eDiscovery platform. For example, the Queue will be sent work items for file identification, text extraction, TIFFing, building search indexes and a whole slew of other tasks. In order for the queue to be able to process an ever-growing list of work items, it needs to be very robust and very fast.

I don’t want this blog to be too technical, but I need to go into some detail to explain why I do some of these things. First of all, most of my competition struggle when ingesting data of any size. They eventually get it done, but depending on the vendor and the size of the matter, this could take days. Truth be told, ingestion is the most CPU and disk I/O intense operation of the entire EDRM model, so it’s no wonder why it can take so long. However, when employing the correct architecture with the correct software, this time can be reduced dramatically.

I’m tempted to let company names of my competition fly as I describe this process, but I won’t (at least for now). Here’s a very simple example of how this process works. Let’s assume we are ingesting just one file for this example – a Word document. Here’s what it takes to correctly process this one document:

Move document to work area
Hash document for storage
Verify document is really a Word Document
Check to see if we can DeNist this file
Check to see if this document has any children
If so, extract child to work area and kick of the process from beginning
If child has no text – OCR
Extract all metadata from the Word Document
Check for text layer within document
If so, extract text and add to search index
If not, OCR and “find” text
Persist all parent/child relationships
Update databases
Move to next document

The above list is very simple and I have skipped over a lot of the smaller steps. Even so, you can see that a fair amount of work needs done with each file – and this was a very simple example. The above process holds true when discovering ZIP files with thousands of other files, or PST files that contain hundreds of thousands of emails. It becomes a very recursive process and can take a long time to complete when using the wrong architecture.

So, now you are probably wondering what I am doing different – glad you asked. First of all, I break down almost all of the steps above (and a lot more) into individual units of work – or what I call Tasks. Each task gets added to the Queue. Every processing server in my infrastructure asks the queue for a new Task. That task is then handed off to a new thread to start working on it. Depending on the number of cores in my servers, each server will have between 8 and 24 threads all working on tasks simultaneously. This allows multiple threads in multiple machines to work on the same PST for example – allowing hundreds of threads to swarm into the PST file and process individual MSG files (and their attachments). This architecture allows me to scale by simply adding more hardware. The software and database infrastructure is being designed to handle an incredible workload and this is one of the keys to keep things hitting on all cylinders.

The reason for this post was to talk a little bit about my QueueManager, but I got a little distracted with why I have a QueueManager in the first place. I’ll be talking a lot about the queue and how it works as I continue the development, but I am just about done with the first rough draft. Over the next couple of days, I should get a proof of concept running to tie all the pieces I have built so far. I will also be able to get some benchmarks at the same time.

For now, it’s time to get back to coding!

Wednesday, November 30, 2011

File Hashing Service

In order to use my new NIST service, I have to submit a hash value for the file I want to identify. That’s where my new File Hash service comes into play. I added a bunch of convenience methods to this service so I can hash new files, look up cached hashes and compare hashes. At its core, I submit a file path that I wish to hash, the service then builds a SHA1 hash of that file by opening a file stream and reading in blocks of bytes until I get to the end of the file. The hash created is then saved in my Storage table along with all the other particulars of the file. Because I now have a unique signature for this file, I can also be sure I only save this file to my working project one time.

Opening up a file stream is a bit slower than just reading all bytes of the file into memory at one time, but the latter does not scale well. During the discovery process, my processing platform will encounter files ranging from one byte up to many gigabytes. In a multithreaded automated process, reading all that data into memory could have disastrous consequences. So, I sacrifice a bit of speed for stability and scalability.

At this point, I have a File Identification Service, a De NIST Service and a File Hashing Service. It feels good to make progress and cross items off my whiteboard. Next on the list: Queue Manager Service. This service will be more involved and have a lot more moving parts.

To Cloud Or Not To Cloud?

...that is the question so many service providers are asking themselves now.

Many of my friends know that I am working on this project, but none of them know anything about eDiscovery or what it takes write great processing and Early Case Assessment (ECA) software. However, most of my buddies are techies like me and understand business. It’s funny to see their faces and hear their comments when I say something like “eDiscovery is headed to the cloud”. See, most other industries are either already in the cloud, or at least taking the elevator to get there. It’s foreign to them when they find out that eDiscovery is dragging its feet kicking and screaming. That’s where this blog and project come in. I believe there is a huge opportunity to capture the business that I believe is already available for a cloud-based solution. I also believe it’s just a matter of time before corporations and law firms are expecting most, if not all, of the eDiscovery phases to be conducted in the cloud. Speed and price are two of the factors that will win the war to move this industry to the cloud.

The next 24 to 36 months will be an interesting transition in this space. I expect to see more people like me who believe the cloud is the future of eDiscovery creating competing products and services. I also expect to see the “big players” of the industry figure out that they are behind the curve and scramble to come up with a way to straddle the line of traditional infrastructure and cloud-based infrastructure. It should make for interesting times!

De NIST Service

I just finished implementing my De NIST service. This was really easy since most of the work is being done by the National Institute of Standards and Technology. They maintain a database of more than 20 Million unique SHA1 Hashes. Each hash corresponds to a common file. In the eDiscovery industry, files that are identified in the NIST database have no value in an eDiscovery case. These files could be anything from standard system files to vendor print drivers to just about anything you can think of. Once identified by NIST, the have no value to us. Finding these files and weeding them out saves a lot of resources down the road.

My eDiscovery platform hashes each file it discovers (more on this in a future post) and then compares those hashes to the NIST database. Anything found gets marked as “deNisted” and moved off so that no other work is performed on that file. Files that are not found in the NIST database will be available for further discovery and processing.

Another service scratched off the TODO list. I now need to create my Hash service which will allow me to hash any file. I will use this to compare it against the NIST database and, more importantly, I will use it to ensure I only save each file (unique hash) one time. This will be used in the De Duplication process that I will get to later.

Thursday, November 24, 2011

eDiscovery File Identification

I just finished writing my File Identification service that my processing platform will use to identify binary files. This is one of the first steps in e-discovery processing since you have to know what the file type is before you know what (if anything) you can do with it. For example, some files have a text layer that needs to be extracted and indexed, some files contain vector graphics that need processing and still other files may be containers that contain other files. Correctly identifying the file is critical.

Too bad it’s not as easy as just looking up the file extension to determine the correct file type. My discovery processing engine needs to be able to tell what type of file it is dealing with even if it has been renamed or is part of a crash dump. There are many ways to accomplish this, but the one I chose is the “magic number system”.

The disadvantage to this system is that it’s designed to be used on binary files only – html, xml and text files be damned! It’s really pretty simple, I open up a byte stream loading in a bunch of bytes, and I then compare several byte groups at predefined offsets and look them up in my “known file types” database. File types that are found are returned and the system moves on to the next task. File types that are not found are marked as such and stored in a special exception location. This allows me to go back and find files that are not being identified and updated my file type database manually – yes, it can be a bit tedious, but it’s the best and most accurate way I have found so far.

Again, this was implemented as a WCF service and will be located in my “Core Services” server farm in my private cloud. Currently my cloud is my development environment, but once I get a bit further along, I’ll start shopping for a colo facility where I can house all my own equipment. I’ll have more on my cloud infrastructure later.

Well, the first service is done and I am moving on to the next – which will probably be my DeNist service.

Wednesday, November 23, 2011

Services, Services And More Services

I’ve identified a number of core services that will need to be exposed to the rest of the platform. Many of these services could also be used outside the platform as well. Because of this, I’ve decided to write these services as WCF Binary Services. Utilizing this model will allow me to consume these services from the platform as well as any other client I create that may be able to use the functionality these services provide. It also gives me a lot of flexibility on how these services are hosted and implemented. For example, my guess is that these will be self-hosted services running with a Windows Service that I create. One of the benefits of this is that I can easily load-balance these services across multiple servers within my infrastructure. However, if speed becomes an issue, I can always just include these services directly in the client. Again, it’s about being flexible and saleable.

Here is a list of the services I will be working on first:

File Identification Service – Identify the type of a given file
File Hashing Service – Generating hashes and comparing hashes for individual files
NIST Service – Identify files that are found in NIST
Queue Manager Service – Add, remove and process task within the queue

Tuesday, November 22, 2011

Picking The Right Tool For The Job

My goal is to write the entire e-discovery processing platform in C#. From the initial set of requirements I created so far, I should be able to accomplish this. I’ve been writing C# since it was first released as a Beta many moons ago, but being proficient in the language is just one reason to for me to use it. I am also looking ahead and know that I will not be the only one creating and maintaining this code if things go right. It will be much easier for me to find C# developers than it will be to find C++ (my other core language) developers. Also, when you get right down to the differences between managed and unmanaged code, I argue that some managed code runs faster than native code. I can hear the C++ fans booing me already, but here are a couple of examples:

When running managed code the JIT compiler compiles it just before it is needed on the hosting machine. When running native code, the code is compiled at design time and needs to be compiled down to very generic machine code so it will run on a variety of systems. Having the JIT compiler do the work only when it is needed allows for optimization in the code that you would not otherwise have. In other words, managed code can be optimized on the fly where native code cannot.
Multiple threads. By using managed code, I can write code that utilizes multiple threads and let the runtime figure out how many threads to use based on how many cores are available. This allows me to write code once and target a huge variety of systems. This goes back to my scalability discussion. Since I am processing everything in parallel, I may have a few massive machines pulling work items from my queue at the same time I have some not-so-fast machines pulling work items from the queue. The optimal number of threads running on any machine can be decided at runtime instead of compile time – a big win!

Now, I’m writing a modular platform, so if it makes sense to write native code for a specific plug-in or feature, then I won’t hesitate to do it. For now though, I’m fairly certain I can get most of this project done using managed code.

Enough theory – it’s time to roll up my sleeves and get started. I’ve identified several core services that my processing platform will need to expose, so I will be working on those over the next few days. Once I get a few of the services out of the way, I will write my first Queue Manager so I can actually do something interesting with what I have written up to that point.

Sunday, November 20, 2011

The Big Three

I’m approaching this project like every other project I’ve ever worked on - beginning with the end in mind. The three most important aspects of processing software in the ediscovery space are accuracy, speed and scalability – or as I like to say, ASS.

Accuracy is pretty self-explanatory – you just can’t be wrong when processing data. For example, identifying file types is one of the first things any good processing software needs to do. If the file is misidentified, all later processing done on that file could be useless. Another example may be a file that was identified to contain no text. In this case, the text would never make it to the search indexes – effectively eliminating any chance of the document being found during keyword searches in the Early Case Assessment (ECA) phase of the project.

Speed is critical in this business. It’s not uncommon for a service provider to process tens of millions of files in a single case. E-Discovery software must be optimized to get through all stages of the discovery process as fast as possible. I’ll be talking more about this in future posts, but know there are many ways my competition “fudge” the numbers to make it look as though they are discovering terabytes of data in mere minutes – not true I tell you and I will explain how this slight of hand works later. For now understand that I plan to process data the “right” way and will do it in a massively parallel way to keep my customers happy. After all, they are who I need to keep happy.

Scalability means many things to many people, so let me clarify what I mean and why I am going down this road. First of all, without getting into too much detail about how the latest generation of computer CPUs work, understand that all new servers now days have multiple cores per CPU, and you can have multiple CPUs per server. Software that is written to take advantage of this type of hardware architecture is said to take advantage of parallel processing – this is just a fancy way of saying that the software on a given machine can do multiple things at the same time if written correctly. More specifically, software can do as many simultaneous tasks as the server has cores. Pretty simple, right? My architecture will take this one step further by allowing for scalability by adding more servers to the mix. As it turns out, this is a very cost effective way of scaling out the production environment since I can take advantage of everything each server has to offer. A positive side-effect of this is that it also builds redundancy into my production environment – something I have planned for another post as I get further along.

All three of these important pieces will be managed by a “processing Queue” that I am just now starting to develop. This will be the heart and soul of the processing software and will most likely go through many iterations as the project matures. I know this because I have been writing software long enough to know that an important piece like this will never get done right the first time. As more requirements are flushed out, the more work the queue will need to do. Fun stuff!

Friday, November 18, 2011

And So It Begins...

I think it was about 18 months ago when I decided I was going to create this e-discovery business. It’s something that I understand and have had success with in the past (albeit working for the man). However, it was not until I started flowcharting some of the design processes over the last couple of weeks that I realized just how BIG this project was going to be. I will admit as I took a step back from my whiteboard and took it all in, I was a bit overwhelmed – well, more than a bit, but I’m trying to stay calm as I write this post.

I’m an object oriented guy and have started breaking these “objects” into manageable pieces. Over the next few days/weeks, I’ll be outlining these pieces and how they fit into the bigger picture. Since my aim is to build this business around the SasS model and running everything in the cloud, I will be creating modules that handle specific tasks. More to come…

Tuesday, November 15, 2011

E-Discovery - What's That?!?

Anyone stumbling upon my blog will most likely already know what E-Discovery is, but here is the Reader’s Digest version for those that may not know…

E-Discovery is the process of taking mass amounts Electronically Stored Information (ESI) and making it available in a common format for lawyers, paralegals and other litigation support specialist that wish to work with the data. By using software we can “convert” just about every file type on a person’s computer (also known as a custodian) to a common format so users of the software can then run text searches, skin-tone analysis, predictive models and a whole slew of other processes that I will cover in this blog. I’m not a lawyer (and I won’t be playing on in this blog), but all the user is really doing is gaining a better understanding of how good or bad the case may be for them – allowing them to make educated decisions on how (or if) to pursue the case. This is very over-simplified, so here is more information.

I have ventured down the path of writing the software the does the heavy lifting. My focus (at least for now) will be in the following three areas:

Processing – the act of getting data in a standard format described above
Early Case Assessment – client facing software allowing quick culling and review of a case or project
Cloud-based everything! – This is what sets me apart from my competition. Everything will be designed to run in the cloud. Crazy, I know…