Monday, April 8, 2013

Excel Spreadsheets - Not Fun!


Excel Spreadsheets are by far the hardest files to work with in the eDiscovery industry.  People use Excel in ways Microsoft never intended.  The versatility of the document format combined with the ingenuity (I could use other adjectives here) of the users creates some pretty crazy documents that eventually find themselves in eDiscovery processing software.

Which brings me to the reason for this post.  During some mass testing last week, I found that in certain circumstances I was unable to find hidden sheets, rows or columns.  First of all, my processing engine is designed to find hidden areas in Excel sheets and documents.  I am currently able to identify hidden sheets, columns, rows and cells.  However, during my testing I found that when I had to extract text from Excel documents that were larger than 400,000 pages, I was missing anything hidden (those of you not in this industry would be shocked at how large some of these Excel documents can get).  Since I am not using Automation for Excel documents and instead relying on the binary structure of the document, I figured it would be an easy find.  Well, I did find the problem, but it took a very long time.  Turns out it was a bug in Microsoft’s file format on very large documents.  I was able to see the problem and work around it, but it has slowed my progress.

As you know I started out last week working on the UI for my engine.  That came to a screeching halt once I found this issue.  As promised, I plan to share the good, bad and ugly during this process.  This was mostly bad with a pinch of ugly thrown in.  I’m back to testing the pieces of the UI that I have completed and am back on track. 

No comments:

Post a Comment