Saturday, March 30, 2013

User Interface (UI) - It Begins



Before I talk too much about the user interface, I wanted to take a step back and talk about how I have (and will continue to) architected the processing engine.  I’ve tried to keep the engine, or Core as I like to call it, very modular.  This allows me to create new file types, filters, etc. and add them easily to the system.  However, the Core is a bunch of code that will run on an Agent machine.  Agent machines are the work horses that actually do the processing.  They will be controlled by a client application that does nothing more than dispatch work and allow the operator visibility into what is going on in the system at any given point.  Since the Agent machines are not interacted with directly, there is no reason to build a UI around Core.  In fact, Core will be running as a Windows Service on the Agent machines.  Each Agent machine will communicate with a couple of services in order to discover new work to perform as well as report statistical data – I will discuss how all this works in more detail in a later post.

This brings me to the reason for this post.  I started out building a simple UI that is responsible for the following:

  • Dispatch work (ingestion, text extraction, index build, convert native to intermediate, image, etc.) to Agent machines
  • Allow operator to configure “settings” for each task before submitting to Agents
  • Create new projects, custodians and tasks
  • Monitor the status of Agent machines
  • Manage workload by moving Agents from group to group
  • Create and apply filters (DeDupe, keyword, embedded items, RegEx, etc.)
  • And a whole lot more 

I used the above list (and a whole lot more) as my requirements document when I started building the UI.  One of the things I had always struggled with on this project was which technology and platform should I use for the UI?  Since the UI is considered a “thin client” because all the real work is being done by the Agent machines and services, I was wide open to picking a client technology.  For example, I could run the entire system from one of the following:

  • Windows Forms application – the old-school standard desktop application
  • Windows WPF application – Windows Presentation Foundation.  Very similar to Forms, but newer and more agile
  • Web-based application – Silverlight or HTML5 browser based application – could be run from Intranet, Extranet, or even Internet if needed
  • Mobile Application – iPad, iPhone or Android based application
I quickly realized that I want to use all of the above client platforms and don’t want to be restricted if my needs change later on.  However, the thought of having to recreate the client framework every time I decided to add a new client application was a bit overwhelming. The answer was Application Programming Interface or API.  By creating a Client API, I can write all the critical client side code one time and have each client application use the API in order to “dispatch” the real work.  This way, I can focus on the look and feel of the client application instead of reinventing the wheel every time I create a new project, add a custodian, apply a filter, etc.  I can hear some of the devs out there saying, but wait, you can’t use the same API for Windows and iOS – what gives?  More on that later, but just know that part of my API is a service architecture using binary services.

Unfortunately for me, I did not figure this out quickly enough, so I had a bunch of re-work to do.  Not a big deal, but if I had a dollar for every line of code I have ever replaced with another line of code over the years, I’d be writing to you from my own private island.

I am no longer restricted on what client technology I want to use. For example, I can use WPF for the desktop application and I can create an Executive Dashboard on an iPad.  The possibilities are endless.  I’ve decided to create my primary UI using WPF for this project.  Creating a UI that requires an API is the right choice, but it will slow down the development a bit.  It complicates the design slightly and for every non-UI function the client needs to perform, I have to write and interface it with the matching method in the API – the good news is that once I get the first client completed, the second client will go much quicker sine the API is already done!

The WPF UI is coming along pretty nicely.  I’m using very rich UI controls to keep the look and feel very fresh and intuitive.  I’m no artist, but I’m pretty good at creating an intuitive UI that you don’t need to go and get certified to use (not mentioning any names).

More to come…

Friday, March 22, 2013

eDiscovery File Type Handlers


I stated in my last post that I got a bit sidetracked during the last several months, but I’m back at it in full speed.  At the heart of all eDiscovery processing engines is the ability to process file types.  By “process” I am talking about the ability to:

  • Extract Metadata
  • Extract Text
  • Convert to Intermediary Format (PDF in my case)
  • Convert to TIFF/JPG and Other Images
  • OCR Non-textual Layers
  • And more 

There are many ways to architect a system to handle the above list of tasks, and I have experimented with them all.  I have decided to go with a pluggable architecture which will allow me to add new file handlers over time without having to worry about recompiling the entire engine.  In most cases this is never an issue, especially when I plan to run this from a cloud environment.  However, if I ever decide to license my engine, I’ll want a way to distribute updates and cause as little interference as possible.   

I have listed the file types that I have completed below.  I am using the term “completed” rather loosely here.  Every file handler below knows how to perform the above list of tasks with the exception of OCRing.  There are many occasions where I decided to write my own widget from scratch instead of paying someone licensing fees to use a widget that is already written.  OCR is not one of them.  I am testing multiple OCR solutions and will eventually pick one of them to work with.  I’ll update my findings in a future post. 

So, the list below represents all the file handlers currently supported by my eDiscovery engine.  A note on Excel and Word documents:  I decided to go back as far as Office 97 for these.  That means that, at present, I cannot process Word or Excel documents earlier than 97.  With limited resources and time, I felt I needed to draw a line somewhere, and this is where it was drawn.  Later, if the need arises, I will create Office handlers for pre-97 documents – again, it’s pluggable, so why not J

 

File Handler List:

 



PPT, PPTX, PDF, PCL, XPS
 
JPG, PNG, BMP, EMF, GIF, WMF, ICON, TIFF, SVG, EXIF
 
DOC, DOCX, TXT, OOXML, ODT, RTF, HTML, XHTML, MHTML
 
PST, OST

MSG, EML, ICS, VCF

XLS, XLSX, XLSM, XLTX, XLTM

ZIP, RAR, 7Z, TAR, GZ, GZIP 
 
I’d say the above list includes most of the documents found on Windows-based collection in the eDiscovery world, but there are a few still missing that will be added in good time.  My short list contains AutoCAD, MBox, Microsoft Visio and Microsoft Project.  Those file handlers will finish my “must have” list.  Although, I will never, ever be done with adding file handlers.  So many new file types, so little time!
 
Also, I spent extra time on MSG files.  Everyone knows that MSG files are Outlook mail messages.  However, they can also be Tasks, vCards and iCal items.  When my system extracts items from a PST or OST, it determines the actual type (vCard, iCal, etc.) and saves the native in its true format.  However, if the system comes across loose files MSG files that were either saved by the user, or extracted from another tool, they are still evaluated to determine the correct type.  This is just one of the small things I do up front to ensure high quality exports and deliverable on the backend – of course, I need to finish this before anyone can actually use it J
 
Now that my “Must Have” list is (almost) complete, it’s time to start working on the client (GUI) side.  My goal over the next several weeks is to get a completed processing engine up and running that will support the above list of files.   This includes the Graphical User Interface (GUI), the services to support it, additional functionality to the core components such as TIFF, filtering, keyword searching, etc.
 
Stay tuned…


Tuesday, March 19, 2013

I'm Back!


Yes, it has been more than a full year since my last post – pretty disappointed about that, but it is what it is.  My personal schedule was crazy last year, and since I can only work on this project during my own time, it suffered quite a bit.

Where am I now?  As of the beginning of February, I have been working hard to get the core components completed.  I’ve made some significant progress and will spend some time on that during my next post.  I wanted to update this blog for a couple of reasons.  1.  I have been getting a few emails asking me where I went and did I give up.  2.  I do this to keep myself motivated. It is sometimes overwhelming when I look at how much work I still have to do and journaling my progress seems to help elevate some of the stress.