Workflows for Semantic Analysis of Large Document Collections

Matthew Rapaport

April 13, 2008


Automated and semi-automated semantic analysis of legal documents, ‘e-discovery’ is today a market worth hundreds of millions of dollars a year, and will soon be worth billions. The reasons are straightforward. First lawyers and paralegals are expensive. Large litigations often involve the discovery and submission of thousands, even tens of thousands, of documents from within document collections sometimes numbering in the millions. A hundred lawyers sitting in a room can take a longtime to review such collections. Second, studies have shown that even trained professionals identify only about 50% of the documents relevant to a particular litigation. Not every lawyer in the room can be an expert in the subject area of the litigation, and even when there are experts, fatigue reduces the accuracy of manual search. A 1-in-2 chance of missing a critical document that might save, or cost, a plaintiff or defendant hundreds of millions of dollars is a terrible gamble. Given the stakes involved in large litigations today, some help in improving the odds of finding what’s needed can be worth money.


Enter the art and science of semantic analysis; applying computer algorithms to text to identify meaning; find relevant documents and reject irrelevant ones based on their meaning. Pure computer-based approaches can find the same 50% of relevant documents for a fraction of the cost of a room full of attorneys. Hybrid approaches using semantic analysis coupled with expert (attorney) attention find as much as 90% of the relevant material while able to reject that which isn’t relevant. Companies in this field take some care to protect their proprietary algorithms. Whatever the algorithms used, they all begin with some canonical text form serving as input to the analytical engine. Having identified a “relevant subset” of the collection, some agreed-upon format is used to deliver documents back to the client. This last step is usually, though not always, necessary because the source material often exists in collections from which it is difficult for the client to extract individual documents given only a list of relevant items.


The need to produce canonical input, and some consistent delivery format, establishes a need for some data flow from the client to the analysis engine and from the analyzer back to the client. Fig 1 illustrates one possible over-all workflow. The potential document numbers, and the number of possible source formats requiring canonization, are the two engineering challenges of the process. Volume has an impact on every step of the workflow, while source format issues are concentrated in the canonization step.


Input from client collections


Companies engaged in this market must prepare to store hundreds of gigabytes, sometimes even terabytes of data for each client though only temporarily. Once analysis is complete the data can (usually must) be deleted. Each client should have its own document domain, and each document type within that domain is segregated in its own space. A large company will typically employ a third party to store their historical documents. Such collections may consist of individual files (possibly in compressed archives), and binary collections of e-mail folders, images, and document sets kept by various large-scale document storage and management products. Every different document format and collection should be identified and provided a file-space that contains nothing but that document or collection type. For a large corporation engaged in a substantial litigation, this might include a dozen or more different collection types. Almost all of them will be more or less binary. Little is stored as plain text these days.


Physically getting these, sometimes, terabytes of data into the local file system is often inconvenient, but not a great engineering challenge in this era of high-volume primary storage. Because the client relationship is temporary, there is usually no need to set up high-capacity pipelines between seller and client. Often it may be easier for the storage vendor to ship document collections as collections of high-capacity CDs! This avoids not only the technical issues involved in setting up data pipelines, but also attendant security issues. E-mail and ftp (secured or unsecured), the easiest exchange protocols to establish, are not often satisfactory for reasons of available bandwidth, while higher-bandwidth options, like sockets, need significant development. Yet secure ftp sessions can run overnight, and this, or CD exchange is usually enough to get the target collection in-house.


Following acquisition of the raw document collections, files are prepared for format conversion. Preparation consists of three steps:


  1. Open individual document collections and discover type from their type specific “binary signature”.
  1. Division of binary collections into individual documents (where necessary), each with its own filename.
  2. Tagging of each document giving name, binary type, tag-date, original collection name and location.


Each document origination software wraps text produced with that software in an envelope of binary and text data characteristic of that software. Like the format conversion step that follows, each of these characteristic signatures must be identified by the work-flow software whose job it is to open the collections and discover their type. As novel types are encountered, each is added to the “domain knowledge” of the classification modules eventually accounting for all the common types. Once the type is known, it becomes possible to split individual documents from their collections. At this point, some companies may elect to store these otherwise “raw documents” in a database or file management system. Formal storage provides identification tags which may, otherwise, be pre-pended or appended to the raw file before format conversion.


Format Conversion


This is the core of the work (Fig 2), the canonization process. It consists of the following components, not all of which will be necessary for every document:


  1. OCR of documents stored as images.
  2. Stripping binary information from various document types.
  3. Checking the conversion to ensure text is readable.
  4. Reformatting of text to match the requirements of the semantic analysis engine.


OCR technology has matured for some years now, and many legal documents are yet stored as images. Most other documents will be in binary form and for each of those possible binary encoding schemes, create a function that strips the binary (formatting) data leaving the text behind. ‘Function’ in this case is not necessarily a function in the normal coding sense, though it might be. It might also be a more complex program, but once created, it can be applied over and again to the same binary data type. It may turn out that a series of functions is created for each document type, stripping different binary overlays. For example, a function for the forms data, one for drawings, and a last one for the text format data.


Developing this function set is the key task of those engineers responsible for massaging the raw data into a form acceptable to the semantic analysis engine. A good pattern matching, and substitution language (for example, PERL – see Text Box 1) is the best over-all tool for this work. At the same time, the range of formats faced in the market numbers in the dozens, not hundreds or thousands; well within the powers of a small staff to maintain and extend as the occasional new format comes along. Initial development is another matter. The data analysis and algorithm design initially needed is not trivial.


Conversion and format checking helps to insure the quality of the finished product passed to the analysis engine. Comparing frequency and types of spelling errors (for example) in the original documents (or images after OCR), to the same documents after formatting, provides a check on the accuracy of the formatting process. Other tests may help to improve the accuracy of the subsequent automated analysis.




This step of the workflow moves the formatted documents into some database for hand-off to the semantic engine, and later (possibly) extraction of the discovered set for re-formatting in and delivery to the client. The challenge of this stage stems from the volume of the data. The entire document set still numbers in the hundreds of thousands, possibly millions of individual documents. After canonization, the prepared file set can be smaller than the original documents (in total bytes) by as much as 50%, but a million documents, even smaller than their original size, is yet not a trivial dataset.


What happens in this part of the data flow depends on the database involved. There are now several mature DBMS technologies and hardware capable of handling tens of millions of large records. Presumably each individual company will have chosen something adequate to their need at this stage. Oracle, for example, supports a high-speed tool to facilitate large data transfers, especially productive with homogeneous datasets such as the canonized documents. Some companies will elect to develop their own high-speed file-handling systems, and there are other tools for managing very large document collections. Sterling Corporation’s “Connect Mailbox”, despite its name, is a powerful document storage and retrieval system. Though not strictly a DBMS, it is much faster than Oracle while protecting document integrity and simplifying identification and retrieval. There are yet other tools, DBMS optimized for bibliographic collections for example.


Because the work of loading text into the file or database is so dependent on individual environments, I won’t spend more time on it here. A suitably chosen work-flow language will work smoothly with any DBMS or file scheme, as well as having the features to handle the necessary data transformations. There are companies who try to do this work in Java! I think, this is a mistake. Script languages are better suited to this work. They perform almost as well as Java in most environments, and have a 10 to 1 advantage in development time. Most importantly, small changes in data formats (and new formats) can be accommodated quickly with small changes in scripts compared with the build and release cycle of more formal languages.


Delivery Format Conversion


The result of semantic analysis is a list of documents that are, purportedly, the set relevant to the discovery objective. Normally, this will be a small subset of the documents handled at the front end of the process. What happens in this part of the workflow depends on individual arrangements with clients. Possibilities include:


  1. Extraction of the canonized text documents from the database or file collection followed by re-formatting of these documents into some form suitable for delivery to the client.
  2. Collection of the original, individual documents (after tagging, but before any format conversion) delivered to the client with or without tags, or possibly an alternative tagging scheme that aids client identification, sorting, viewing or printing as necessary.
  3. The document list alone is prepared for delivery, the client being able to recover the individual documents from their own collections.


This data flow step is relatively easy compared with the initial formatting of the various documents for semantic analysis. At this stage, the documents exist in canonized form. There is only one input to any necessary conversion. Simply delivering documents in their original form (possibly minus tagging information) satisfies any client requirement for the original documents. A distinction between this step and the canonization of raw data is that in the former, the document type, not the individual client, drives process design. In this step, the document is uniform and client requirements drive the design of the data flow. The first is input-driven, this, output-driven.


Client I/O


This step is the inverse of the first step at the beginning of the workflow and involves, usually, much smaller document set. Delivery by CD, or some other commonly available path (for example, secure ftp, e-mail, or client-initiated capture over the WEB) may all be reasonable alternatives depending on the size of the final document set.




Federal regulations about document archives for “civil procedures” encourage corporations to take a closer look at managing large document archives while the litigations themselves are driving an expanding market for “e-discovery”.  Automating the process of finding documents relevant to a particular litigation from massive archives has, potentially, a high payoff. E-discovery software vendors promise some cost savings over rooms filled with attorneys pouring over document collections. But pure software solutions, like the manual processes that preceded them, are prone to error. This creates a market for companies that run e-discovery software engines (of their own design or from third parties), and make a business of enhancing the results with attorneys specializing in the discovery process itself.


Combining software’s reliability and repeatability, and the human ability to discover associations, requires movement and transformation of data into the discovery vendor, through its software, and back to the client. This, in turn, creates a need for work or data flow, and someone has to create and manage software, executing programs, that perform that work. Workflow in the e-discovery business is unique in a couple of ways:


  1. It involves large volumes of data delivered and processed at a point in time. The volume poses some serious (though these days not insurmountable) issues for both hardware managers and software engineers involved in making all of this perform for any given e-discovery service provider. The punctuated nature of data delivery requires the utmost performance from storage, hardware and software.
  2. Every individual client-dataflow is potentially unique (requires some “exception handling”) in one or more of its steps. Most corporate workflows run regularly, with demand, data and output requirements changing slowly. In this case, novel data is not unexpected, and output format variations more the rule than the exception. 


These two factors mitigate the use of large-scale EAI software. Today this software is capable of handling the volumes of individual client data, including large numbers of very-large individual files. But the performance of such packages when processing so many large files is poor. E-services are often sold close to result-set delivery deadlines set by courts! The service provider does not have the luxury of allowing a client’s data to load for a few days! Second, the EAI packages orient toward database-to-database (application-to-application) data transformations. There is little support for format alteration of the character-representation of the data; exactly those transformations as are the stock-in-trade of e-discovery data flows. Finally, modern EAI software doesn’t lend itself to quick adjustments of the transformation algorithms suited to a particular client’s data.


Script languages with powerful data-transformation features are much better suited to e-discovery data flow tasks. An entire script-system consisting of functionally separated layers can be changed quickly at any layer to suit the needs of a client without impinging on the data flows of other clients. If designed correctly, scripts extend easily. The goal of data flow designers should be to minimize and isolate the elements of the workflow that are often extended to meet the special needs of individual clients. Once accomplished, changing one step, for one client, without altering processes preceding or succeeding it, is not difficult. Fully optimizing the preceding and succeeding steps for maximum performance is then possible.


File storage too can be optimized for what amounts to bibliographic collections. Purely with respect to data, e-discovery shares much in common with other very large-scale bibliographic storage applications! General purpose DBMS such as Oracle, Sybase, or SQL Station may not be the best choice in this domain. Even with product support for large bibliographic collections, their performance suffers compared with software more specialized for the task, a subject for another article.