My eDiscovery Project: File Hashing Service

Wednesday, November 30, 2011

File Hashing Service

In order to use my new NIST service, I have to submit a hash value for the file I want to identify. That’s where my new File Hash service comes into play. I added a bunch of convenience methods to this service so I can hash new files, look up cached hashes and compare hashes. At its core, I submit a file path that I wish to hash, the service then builds a SHA1 hash of that file by opening a file stream and reading in blocks of bytes until I get to the end of the file. The hash created is then saved in my Storage table along with all the other particulars of the file. Because I now have a unique signature for this file, I can also be sure I only save this file to my working project one time.

Opening up a file stream is a bit slower than just reading all bytes of the file into memory at one time, but the latter does not scale well. During the discovery process, my processing platform will encounter files ranging from one byte up to many gigabytes. In a multithreaded automated process, reading all that data into memory could have disastrous consequences. So, I sacrifice a bit of speed for stability and scalability.

At this point, I have a File Identification Service, a De NIST Service and a File Hashing Service. It feels good to make progress and cross items off my whiteboard. Next on the list: Queue Manager Service. This service will be more involved and have a lot more moving parts.

My eDiscovery Project

Wednesday, November 30, 2011

File Hashing Service

No comments:

Post a Comment