In order to use my new NIST service, I have to submit a hash value for the file I want to identify. That’s where my new File Hash service comes into play. I added a bunch of convenience methods to this service so I can hash new files, look up cached hashes and compare hashes. At its core, I submit a file path that I wish to hash, the service then builds a SHA1 hash of that file by opening a file stream and reading in blocks of bytes until I get to the end of the file. The hash created is then saved in my Storage table along with all the other particulars of the file. Because I now have a unique signature for this file, I can also be sure I only save this file to my working project one time.
Opening up a file stream is a bit slower than just reading all bytes of the file into memory at one time, but the latter does not scale well. During the discovery process, my processing platform will encounter files ranging from one byte up to many gigabytes. In a multithreaded automated process, reading all that data into memory could have disastrous consequences. So, I sacrifice a bit of speed for stability and scalability.
At this point, I have a File Identification Service, a De NIST Service and a File Hashing Service. It feels good to make progress and cross items off my whiteboard. Next on the list: Queue Manager Service. This service will be more involved and have a lot more moving parts.
Wednesday, November 30, 2011
To Cloud Or Not To Cloud?
...that is the question so many service providers are asking themselves now.
Many of my friends know that I am working on this project, but none of them know anything about eDiscovery or what it takes write great processing and Early Case Assessment (ECA) software. However, most of my buddies are techies like me and understand business. It’s funny to see their faces and hear their comments when I say something like “eDiscovery is headed to the cloud”. See, most other industries are either already in the cloud, or at least taking the elevator to get there. It’s foreign to them when they find out that eDiscovery is dragging its feet kicking and screaming. That’s where this blog and project come in. I believe there is a huge opportunity to capture the business that I believe is already available for a cloud-based solution. I also believe it’s just a matter of time before corporations and law firms are expecting most, if not all, of the eDiscovery phases to be conducted in the cloud. Speed and price are two of the factors that will win the war to move this industry to the cloud.
The next 24 to 36 months will be an interesting transition in this space. I expect to see more people like me who believe the cloud is the future of eDiscovery creating competing products and services. I also expect to see the “big players” of the industry figure out that they are behind the curve and scramble to come up with a way to straddle the line of traditional infrastructure and cloud-based infrastructure. It should make for interesting times!
De NIST Service
I just finished implementing my De NIST service. This was really easy since most of the work is being done by the National Institute of Standards and Technology. They maintain a database of more than 20 Million unique SHA1 Hashes. Each hash corresponds to a common file. In the eDiscovery industry, files that are identified in the NIST database have no value in an eDiscovery case. These files could be anything from standard system files to vendor print drivers to just about anything you can think of. Once identified by NIST, the have no value to us. Finding these files and weeding them out saves a lot of resources down the road.
My eDiscovery platform hashes each file it discovers (more on this in a future post) and then compares those hashes to the NIST database. Anything found gets marked as “deNisted” and moved off so that no other work is performed on that file. Files that are not found in the NIST database will be available for further discovery and processing.
Another service scratched off the TODO list. I now need to create my Hash service which will allow me to hash any file. I will use this to compare it against the NIST database and, more importantly, I will use it to ensure I only save each file (unique hash) one time. This will be used in the De Duplication process that I will get to later.
My eDiscovery platform hashes each file it discovers (more on this in a future post) and then compares those hashes to the NIST database. Anything found gets marked as “deNisted” and moved off so that no other work is performed on that file. Files that are not found in the NIST database will be available for further discovery and processing.
Another service scratched off the TODO list. I now need to create my Hash service which will allow me to hash any file. I will use this to compare it against the NIST database and, more importantly, I will use it to ensure I only save each file (unique hash) one time. This will be used in the De Duplication process that I will get to later.
Thursday, November 24, 2011
eDiscovery File Identification
I just finished writing my File Identification service that my processing platform will use to identify binary files. This is one of the first steps in e-discovery processing since you have to know what the file type is before you know what (if anything) you can do with it. For example, some files have a text layer that needs to be extracted and indexed, some files contain vector graphics that need processing and still other files may be containers that contain other files. Correctly identifying the file is critical.
Too bad it’s not as easy as just looking up the file extension to determine the correct file type. My discovery processing engine needs to be able to tell what type of file it is dealing with even if it has been renamed or is part of a crash dump. There are many ways to accomplish this, but the one I chose is the “magic number system”.
The disadvantage to this system is that it’s designed to be used on binary files only – html, xml and text files be damned! It’s really pretty simple, I open up a byte stream loading in a bunch of bytes, and I then compare several byte groups at predefined offsets and look them up in my “known file types” database. File types that are found are returned and the system moves on to the next task. File types that are not found are marked as such and stored in a special exception location. This allows me to go back and find files that are not being identified and updated my file type database manually – yes, it can be a bit tedious, but it’s the best and most accurate way I have found so far.
Again, this was implemented as a WCF service and will be located in my “Core Services” server farm in my private cloud. Currently my cloud is my development environment, but once I get a bit further along, I’ll start shopping for a colo facility where I can house all my own equipment. I’ll have more on my cloud infrastructure later.
Well, the first service is done and I am moving on to the next – which will probably be my DeNist service.
Too bad it’s not as easy as just looking up the file extension to determine the correct file type. My discovery processing engine needs to be able to tell what type of file it is dealing with even if it has been renamed or is part of a crash dump. There are many ways to accomplish this, but the one I chose is the “magic number system”.
The disadvantage to this system is that it’s designed to be used on binary files only – html, xml and text files be damned! It’s really pretty simple, I open up a byte stream loading in a bunch of bytes, and I then compare several byte groups at predefined offsets and look them up in my “known file types” database. File types that are found are returned and the system moves on to the next task. File types that are not found are marked as such and stored in a special exception location. This allows me to go back and find files that are not being identified and updated my file type database manually – yes, it can be a bit tedious, but it’s the best and most accurate way I have found so far.
Again, this was implemented as a WCF service and will be located in my “Core Services” server farm in my private cloud. Currently my cloud is my development environment, but once I get a bit further along, I’ll start shopping for a colo facility where I can house all my own equipment. I’ll have more on my cloud infrastructure later.
Well, the first service is done and I am moving on to the next – which will probably be my DeNist service.
Wednesday, November 23, 2011
Services, Services And More Services
I’ve identified a number of core services that will need to be exposed to the rest of the platform. Many of these services could also be used outside the platform as well. Because of this, I’ve decided to write these services as WCF Binary Services. Utilizing this model will allow me to consume these services from the platform as well as any other client I create that may be able to use the functionality these services provide. It also gives me a lot of flexibility on how these services are hosted and implemented. For example, my guess is that these will be self-hosted services running with a Windows Service that I create. One of the benefits of this is that I can easily load-balance these services across multiple servers within my infrastructure. However, if speed becomes an issue, I can always just include these services directly in the client. Again, it’s about being flexible and saleable.
Here is a list of the services I will be working on first:
- File Identification Service – Identify the type of a given file
- File Hashing Service – Generating hashes and comparing hashes for individual files
- NIST Service – Identify files that are found in NIST
- Queue Manager Service – Add, remove and process task within the queue
Tuesday, November 22, 2011
Picking The Right Tool For The Job
My goal is to write the entire e-discovery processing platform in C#. From the initial set of requirements I created so far, I should be able to accomplish this. I’ve been writing C# since it was first released as a Beta many moons ago, but being proficient in the language is just one reason to for me to use it. I am also looking ahead and know that I will not be the only one creating and maintaining this code if things go right. It will be much easier for me to find C# developers than it will be to find C++ (my other core language) developers. Also, when you get right down to the differences between managed and unmanaged code, I argue that some managed code runs faster than native code. I can hear the C++ fans booing me already, but here are a couple of examples:
Now, I’m writing a modular platform, so if it makes sense to write native code for a specific plug-in or feature, then I won’t hesitate to do it. For now though, I’m fairly certain I can get most of this project done using managed code.
Enough theory – it’s time to roll up my sleeves and get started. I’ve identified several core services that my processing platform will need to expose, so I will be working on those over the next few days. Once I get a few of the services out of the way, I will write my first Queue Manager so I can actually do something interesting with what I have written up to that point.
- When running managed code the JIT compiler compiles it just before it is needed on the hosting machine. When running native code, the code is compiled at design time and needs to be compiled down to very generic machine code so it will run on a variety of systems. Having the JIT compiler do the work only when it is needed allows for optimization in the code that you would not otherwise have. In other words, managed code can be optimized on the fly where native code cannot.
- Multiple threads. By using managed code, I can write code that utilizes multiple threads and let the runtime figure out how many threads to use based on how many cores are available. This allows me to write code once and target a huge variety of systems. This goes back to my scalability discussion. Since I am processing everything in parallel, I may have a few massive machines pulling work items from my queue at the same time I have some not-so-fast machines pulling work items from the queue. The optimal number of threads running on any machine can be decided at runtime instead of compile time – a big win!
Sunday, November 20, 2011
The Big Three
I’m approaching this project like every other project I’ve ever worked on - beginning with the end in mind. The three most important aspects of processing software in the ediscovery space are accuracy, speed and scalability – or as I like to say, ASS.
Accuracy is pretty self-explanatory – you just can’t be wrong when processing data. For example, identifying file types is one of the first things any good processing software needs to do. If the file is misidentified, all later processing done on that file could be useless. Another example may be a file that was identified to contain no text. In this case, the text would never make it to the search indexes – effectively eliminating any chance of the document being found during keyword searches in the Early Case Assessment (ECA) phase of the project.
Speed is critical in this business. It’s not uncommon for a service provider to process tens of millions of files in a single case. E-Discovery software must be optimized to get through all stages of the discovery process as fast as possible. I’ll be talking more about this in future posts, but know there are many ways my competition “fudge” the numbers to make it look as though they are discovering terabytes of data in mere minutes – not true I tell you and I will explain how this slight of hand works later. For now understand that I plan to process data the “right” way and will do it in a massively parallel way to keep my customers happy. After all, they are who I need to keep happy.
Scalability means many things to many people, so let me clarify what I mean and why I am going down this road. First of all, without getting into too much detail about how the latest generation of computer CPUs work, understand that all new servers now days have multiple cores per CPU, and you can have multiple CPUs per server. Software that is written to take advantage of this type of hardware architecture is said to take advantage of parallel processing – this is just a fancy way of saying that the software on a given machine can do multiple things at the same time if written correctly. More specifically, software can do as many simultaneous tasks as the server has cores. Pretty simple, right? My architecture will take this one step further by allowing for scalability by adding more servers to the mix. As it turns out, this is a very cost effective way of scaling out the production environment since I can take advantage of everything each server has to offer. A positive side-effect of this is that it also builds redundancy into my production environment – something I have planned for another post as I get further along.
All three of these important pieces will be managed by a “processing Queue” that I am just now starting to develop. This will be the heart and soul of the processing software and will most likely go through many iterations as the project matures. I know this because I have been writing software long enough to know that an important piece like this will never get done right the first time. As more requirements are flushed out, the more work the queue will need to do. Fun stuff!
Accuracy is pretty self-explanatory – you just can’t be wrong when processing data. For example, identifying file types is one of the first things any good processing software needs to do. If the file is misidentified, all later processing done on that file could be useless. Another example may be a file that was identified to contain no text. In this case, the text would never make it to the search indexes – effectively eliminating any chance of the document being found during keyword searches in the Early Case Assessment (ECA) phase of the project.
Speed is critical in this business. It’s not uncommon for a service provider to process tens of millions of files in a single case. E-Discovery software must be optimized to get through all stages of the discovery process as fast as possible. I’ll be talking more about this in future posts, but know there are many ways my competition “fudge” the numbers to make it look as though they are discovering terabytes of data in mere minutes – not true I tell you and I will explain how this slight of hand works later. For now understand that I plan to process data the “right” way and will do it in a massively parallel way to keep my customers happy. After all, they are who I need to keep happy.
Scalability means many things to many people, so let me clarify what I mean and why I am going down this road. First of all, without getting into too much detail about how the latest generation of computer CPUs work, understand that all new servers now days have multiple cores per CPU, and you can have multiple CPUs per server. Software that is written to take advantage of this type of hardware architecture is said to take advantage of parallel processing – this is just a fancy way of saying that the software on a given machine can do multiple things at the same time if written correctly. More specifically, software can do as many simultaneous tasks as the server has cores. Pretty simple, right? My architecture will take this one step further by allowing for scalability by adding more servers to the mix. As it turns out, this is a very cost effective way of scaling out the production environment since I can take advantage of everything each server has to offer. A positive side-effect of this is that it also builds redundancy into my production environment – something I have planned for another post as I get further along.
All three of these important pieces will be managed by a “processing Queue” that I am just now starting to develop. This will be the heart and soul of the processing software and will most likely go through many iterations as the project matures. I know this because I have been writing software long enough to know that an important piece like this will never get done right the first time. As more requirements are flushed out, the more work the queue will need to do. Fun stuff!
Friday, November 18, 2011
And So It Begins...
I think it was about 18 months ago when I decided I was going to create this e-discovery business. It’s something that I understand and have had success with in the past (albeit working for the man). However, it was not until I started flowcharting some of the design processes over the last couple of weeks that I realized just how BIG this project was going to be. I will admit as I took a step back from my whiteboard and took it all in, I was a bit overwhelmed – well, more than a bit, but I’m trying to stay calm as I write this post.
I’m an object oriented guy and have started breaking these “objects” into manageable pieces. Over the next few days/weeks, I’ll be outlining these pieces and how they fit into the bigger picture. Since my aim is to build this business around the SasS model and running everything in the cloud, I will be creating modules that handle specific tasks. More to come…
I’m an object oriented guy and have started breaking these “objects” into manageable pieces. Over the next few days/weeks, I’ll be outlining these pieces and how they fit into the bigger picture. Since my aim is to build this business around the SasS model and running everything in the cloud, I will be creating modules that handle specific tasks. More to come…
Tuesday, November 15, 2011
E-Discovery - What's That?!?
Anyone stumbling upon my blog will most likely already know what E-Discovery is, but here is the Reader’s Digest version for those that may not know…
E-Discovery is the process of taking mass amounts Electronically Stored Information (ESI) and making it available in a common format for lawyers, paralegals and other litigation support specialist that wish to work with the data. By using software we can “convert” just about every file type on a person’s computer (also known as a custodian) to a common format so users of the software can then run text searches, skin-tone analysis, predictive models and a whole slew of other processes that I will cover in this blog. I’m not a lawyer (and I won’t be playing on in this blog), but all the user is really doing is gaining a better understanding of how good or bad the case may be for them – allowing them to make educated decisions on how (or if) to pursue the case. This is very over-simplified, so here is more information.
I have ventured down the path of writing the software the does the heavy lifting. My focus (at least for now) will be in the following three areas:
E-Discovery is the process of taking mass amounts Electronically Stored Information (ESI) and making it available in a common format for lawyers, paralegals and other litigation support specialist that wish to work with the data. By using software we can “convert” just about every file type on a person’s computer (also known as a custodian) to a common format so users of the software can then run text searches, skin-tone analysis, predictive models and a whole slew of other processes that I will cover in this blog. I’m not a lawyer (and I won’t be playing on in this blog), but all the user is really doing is gaining a better understanding of how good or bad the case may be for them – allowing them to make educated decisions on how (or if) to pursue the case. This is very over-simplified, so here is more information.
I have ventured down the path of writing the software the does the heavy lifting. My focus (at least for now) will be in the following three areas:
- Processing – the act of getting data in a standard format described above
- Early Case Assessment – client facing software allowing quick culling and review of a case or project
- Cloud-based everything! – This is what sets me apart from my competition. Everything will be designed to run in the cloud. Crazy, I know…
Subscribe to:
Posts (Atom)