My eDiscovery Project: 2013

Wednesday, April 10, 2013

Fear Is Slowing Me Down

I’m still working on the UI portion of my eDiscovery engine. However, it’s been a few days since I’ve made any real progress. I’ve been busy with my day job and home life, but the more I think about it, I don’t believe my current lack of progress has anything to do with how busy I have been lately. Instead, I think fear is playing a huge role here. Combined with what I call “Development Dip”, it’s really become something I need to deal with in order to move on.

First off, the fear comes from all the what-ifs I keep asking myself. What if nobody buys my software? What if I finish and find that I need to target a different market? What if it’s not as good as I think it is? What if have to tell my family that I spent all these months for nothing? What if I have to give up because it’s too big for one person to pull off? I could go on and on with what-ifs.

The “Development Dip”, as I like to call it, is part of the process for me. When starting on a new project and outlining the requirements, it’s exciting. When building out the architecture and figuring out what pieces are going to go where and how the system will need to scale over time, that’s a lot of fun too. Once all of that stuff is out of the way, the coding starts. This is where the rubber meets the road. This is where all the late nights and weekends come in to play. I use an Agile development process and tweak it to work for me. The idea is to work on the current “items” list and get them done in a certain period of time. However, since there is no way to know 100% of all the requirements of an application like this up front, there are many times when things will not work as originally designed. This is normal and part of the development process. I just need to create a new backlog item and add it to the pool of items to be completed. Anyway, the Development Dip, for me, comes when I start to realize my forward progress is slower than the growth of my items backlog list – in other words, each day I finish coding (or night in my case), I find myself with more work to do than when I started. This sometimes gives me an overwhelmed feeling or a feeling that I am moving backwards.

So, when I keep hearing myself say things like “what if nobody will buy my software?” and my TODO list continues to outgrow my DONE list, it does not take long to realize why my productivity is suffering. I’m literally talking myself out of doing the work that needs to be done. I’m finding “reasons” not to complete my project.

When I look at this objectively, I know that these are the times when I need to keep pushing to get to my next milestone. As each milestone gets completed, it helps fuel the mission to the next. I need to reevaluate how I’m setting my milestones and making sure they are not too far apart. I guess I just need more mini-victories along this journey to stay engaged and energized.

Onwards and upwards…

Monday, April 8, 2013

Excel Spreadsheets - Not Fun!

Excel Spreadsheets are by far the hardest files to work with in the eDiscovery industry. People use Excel in ways Microsoft never intended. The versatility of the document format combined with the ingenuity (I could use other adjectives here) of the users creates some pretty crazy documents that eventually find themselves in eDiscovery processing software.

Which brings me to the reason for this post. During some mass testing last week, I found that in certain circumstances I was unable to find hidden sheets, rows or columns. First of all, my processing engine is designed to find hidden areas in Excel sheets and documents. I am currently able to identify hidden sheets, columns, rows and cells. However, during my testing I found that when I had to extract text from Excel documents that were larger than 400,000 pages, I was missing anything hidden (those of you not in this industry would be shocked at how large some of these Excel documents can get). Since I am not using Automation for Excel documents and instead relying on the binary structure of the document, I figured it would be an easy find. Well, I did find the problem, but it took a very long time. Turns out it was a bug in Microsoft’s file format on very large documents. I was able to see the problem and work around it, but it has slowed my progress.

As you know I started out last week working on the UI for my engine. That came to a screeching halt once I found this issue. As promised, I plan to share the good, bad and ugly during this process. This was mostly bad with a pinch of ugly thrown in. I’m back to testing the pieces of the UI that I have completed and am back on track.

Saturday, March 30, 2013

User Interface (UI) - It Begins

Before I talk too much about the user interface, I wanted to take a step back and talk about how I have (and will continue to) architected the processing engine. I’ve tried to keep the engine, or Core as I like to call it, very modular. This allows me to create new file types, filters, etc. and add them easily to the system. However, the Core is a bunch of code that will run on an Agent machine. Agent machines are the work horses that actually do the processing. They will be controlled by a client application that does nothing more than dispatch work and allow the operator visibility into what is going on in the system at any given point. Since the Agent machines are not interacted with directly, there is no reason to build a UI around Core. In fact, Core will be running as a Windows Service on the Agent machines. Each Agent machine will communicate with a couple of services in order to discover new work to perform as well as report statistical data – I will discuss how all this works in more detail in a later post.

This brings me to the reason for this post. I started out building a simple UI that is responsible for the following:

Dispatch work (ingestion, text extraction, index build, convert native to intermediate, image, etc.) to Agent machines

Allow operator to configure “settings” for each task before submitting to Agents

Create new projects, custodians and tasks

Monitor the status of Agent machines

Manage workload by moving Agents from group to group

Create and apply filters (DeDupe, keyword, embedded items, RegEx, etc.)

And a whole lot more

I used the above list (and a whole lot more) as my requirements document when I started building the UI. One of the things I had always struggled with on this project was which technology and platform should I use for the UI? Since the UI is considered a “thin client” because all the real work is being done by the Agent machines and services, I was wide open to picking a client technology. For example, I could run the entire system from one of the following:

Windows Forms application – the old-school standard desktop application

Windows WPF application – Windows Presentation Foundation. Very similar to Forms, but newer and more agile

Web-based application – Silverlight or HTML5 browser based application – could be run from Intranet, Extranet, or even Internet if needed

Mobile Application – iPad, iPhone or Android based application

I quickly realized that I want to use all of the above client platforms and don’t want to be restricted if my needs change later on. However, the thought of having to recreate the client framework every time I decided to add a new client application was a bit overwhelming. The answer was Application Programming Interface or API. By creating a Client API, I can write all the critical client side code one time and have each client application use the API in order to “dispatch” the real work. This way, I can focus on the look and feel of the client application instead of reinventing the wheel every time I create a new project, add a custodian, apply a filter, etc. I can hear some of the devs out there saying, but wait, you can’t use the same API for Windows and iOS – what gives? More on that later, but just know that part of my API is a service architecture using binary services.

Unfortunately for me, I did not figure this out quickly enough, so I had a bunch of re-work to do. Not a big deal, but if I had a dollar for every line of code I have ever replaced with another line of code over the years, I’d be writing to you from my own private island.

I am no longer restricted on what client technology I want to use. For example, I can use WPF for the desktop application and I can create an Executive Dashboard on an iPad. The possibilities are endless. I’ve decided to create my primary UI using WPF for this project. Creating a UI that requires an API is the right choice, but it will slow down the development a bit. It complicates the design slightly and for every non-UI function the client needs to perform, I have to write and interface it with the matching method in the API – the good news is that once I get the first client completed, the second client will go much quicker sine the API is already done!

The WPF UI is coming along pretty nicely. I’m using very rich UI controls to keep the look and feel very fresh and intuitive. I’m no artist, but I’m pretty good at creating an intuitive UI that you don’t need to go and get certified to use (not mentioning any names).

More to come…

Friday, March 22, 2013

eDiscovery File Type Handlers

I stated in my last post that I got a bit sidetracked during the last several months, but I’m back at it in full speed. At the heart of all eDiscovery processing engines is the ability to process file types. By “process” I am talking about the ability to:

Extract Metadata
Extract Text
Convert to Intermediary Format (PDF in my case)
Convert to TIFF/JPG and Other Images
OCR Non-textual Layers
And more

There are many ways to architect a system to handle the above list of tasks, and I have experimented with them all. I have decided to go with a pluggable architecture which will allow me to add new file handlers over time without having to worry about recompiling the entire engine. In most cases this is never an issue, especially when I plan to run this from a cloud environment. However, if I ever decide to license my engine, I’ll want a way to distribute updates and cause as little interference as possible.

I have listed the file types that I have completed below. I am using the term “completed” rather loosely here. Every file handler below knows how to perform the above list of tasks with the exception of OCRing. There are many occasions where I decided to write my own widget from scratch instead of paying someone licensing fees to use a widget that is already written. OCR is not one of them. I am testing multiple OCR solutions and will eventually pick one of them to work with. I’ll update my findings in a future post.

So, the list below represents all the file handlers currently supported by my eDiscovery engine. A note on Excel and Word documents: I decided to go back as far as Office 97 for these. That means that, at present, I cannot process Word or Excel documents earlier than 97. With limited resources and time, I felt I needed to draw a line somewhere, and this is where it was drawn. Later, if the need arises, I will create Office handlers for pre-97 documents – again, it’s pluggable, so why not J

File Handler List:

PPT, PPTX, PDF, PCL, XPS

JPG, PNG, BMP, EMF, GIF, WMF, ICON, TIFF, SVG, EXIF

DOC, DOCX, TXT, OOXML, ODT, RTF, HTML, XHTML, MHTML

PST, OST

MSG, EML, ICS, VCF

XLS, XLSX, XLSM, XLTX, XLTM

ZIP, RAR, 7Z, TAR, GZ, GZIP

I’d say the above list includes most of the documents found on Windows-based collection in the eDiscovery world, but there are a few still missing that will be added in good time. My short list contains AutoCAD, MBox, Microsoft Visio and Microsoft Project. Those file handlers will finish my “must have” list. Although, I will never, ever be done with adding file handlers. So many new file types, so little time!

Also, I spent extra time on MSG files. Everyone knows that MSG files are Outlook mail messages. However, they can also be Tasks, vCards and iCal items. When my system extracts items from a PST or OST, it determines the actual type (vCard, iCal, etc.) and saves the native in its true format. However, if the system comes across loose files MSG files that were either saved by the user, or extracted from another tool, they are still evaluated to determine the correct type. This is just one of the small things I do up front to ensure high quality exports and deliverable on the backend – of course, I need to finish this before anyone can actually use it J

Now that my “Must Have” list is (almost) complete, it’s time to start working on the client (GUI) side. My goal over the next several weeks is to get a completed processing engine up and running that will support the above list of files. This includes the Graphical User Interface (GUI), the services to support it, additional functionality to the core components such as TIFF, filtering, keyword searching, etc.

Stay tuned…

Tuesday, March 19, 2013

I'm Back!

Yes, it has been more than a full year since my last post – pretty disappointed about that, but it is what it is. My personal schedule was crazy last year, and since I can only work on this project during my own time, it suffered quite a bit.

Where am I now? As of the beginning of February, I have been working hard to get the core components completed. I’ve made some significant progress and will spend some time on that during my next post. I wanted to update this blog for a couple of reasons. 1. I have been getting a few emails asking me where I went and did I give up. 2. I do this to keep myself motivated. It is sometimes overwhelming when I look at how much work I still have to do and journaling my progress seems to help elevate some of the stress.