I just finished writing my File Identification service that my processing platform will use to identify binary files. This is one of the first steps in e-discovery processing since you have to know what the file type is before you know what (if anything) you can do with it. For example, some files have a text layer that needs to be extracted and indexed, some files contain vector graphics that need processing and still other files may be containers that contain other files. Correctly identifying the file is critical.
Too bad it’s not as easy as just looking up the file extension to determine the correct file type. My discovery processing engine needs to be able to tell what type of file it is dealing with even if it has been renamed or is part of a crash dump. There are many ways to accomplish this, but the one I chose is the “magic number system”.
The disadvantage to this system is that it’s designed to be used on binary files only – html, xml and text files be damned! It’s really pretty simple, I open up a byte stream loading in a bunch of bytes, and I then compare several byte groups at predefined offsets and look them up in my “known file types” database. File types that are found are returned and the system moves on to the next task. File types that are not found are marked as such and stored in a special exception location. This allows me to go back and find files that are not being identified and updated my file type database manually – yes, it can be a bit tedious, but it’s the best and most accurate way I have found so far.
Again, this was implemented as a WCF service and will be located in my “Core Services” server farm in my private cloud. Currently my cloud is my development environment, but once I get a bit further along, I’ll start shopping for a colo facility where I can house all my own equipment. I’ll have more on my cloud infrastructure later.
Well, the first service is done and I am moving on to the next – which will probably be my DeNist service.
No comments:
Post a Comment