Leveraging machine learning to speed up fact finding in litigations

Beitrag als PDF (Download)


With ongoing developments in digitization and automation, the working environment is flooded with information: Digital documents and information pile up in our inboxes, converge into the cloud and accumulate on employees’ devices.

Whenever a company is involved in legal proceedings or facing litigation, potentially relevant information must be identified, collected, and evaluated. This often results in large volumes of data. Even a single matter can require a review of tens of thousands of documents to classify and narrow down the pool of items that are relevant to the matter. Not only is the size of a matter growing, but so is the average number of matters, resulting in massive yearly spending for litigations.

Document review and analysis…

Document review and analysis is a crucial part of a litigation. Given the ever-increasing volume of data, finding the needle in the haystack is way beyond a manual task. Classic approaches of linear reviews that group documents into random batches to structure the review population into manageable packages fall short, in terms of efficiency and speed, of generating results.

Today, the help of technology is required to review the documents in an efficient and timely manner: Technology-assisted review technologies like Active Learning are machine learning algorithms that can be trained individually to recognize which data could be potentially relevant to a particular matter at hand. Active Learning is one of the assisted review features used to train and create machine learning models to help speed up the human review process. The overall goal is to streamline the review process and increase efficiency as much as possible to find relevant information quickly while also reducing human effort. In many cases, Active Learning enables the team working on the matter to stop the manual review at an earlier stage without having to review the complete universe of documents. The decision to stop the review is based on the result of the model and a statistically backed test. Hence, a well-trained model also minimizes the time and cost of such review.

…and how it is done today: Active learning

How does this “magic” work? The algorithm builds a mathematical model based on the manual decisions of the reviewers. For each document in the review scope, a likelihood of relevancy with a respective scoring is calculated. All additional human decisions on new documents are used to continuously update the model while the review is in progress, thereby improving the results throughout the overall review process.

At a technical level, the algorithm applies a so-called support vector machine. For each document, a position—a so-called vector— in a space is calculated based on textual content of the review scope. The algorithm then calculates a dividing plane within the space that classifies documents as relevant on the one side and not relevant on the other. With each update of the model, the position of the separating plane is further adjusted, making the classification more accurate. The scoring ultimately results from the distance of each document to the dividing plane.

What is important for the overall quality assurance is that the human reviewer always stays in control of the results. While Active Learning makes suggestions for the classification of documents via relevance scoring, the documents are not tagged as relevant by the algorithm. The main purpose of this workflow is to streamline the review and queue the documents for manual classification in the specific order that is most useful for the current matter. Depending on the specific requirements of the matter, Active Learning offers diverse ways for queuing in which the documents are provided to the reviewer:

Prioritized review

If the focus is to identify relevant documents as quickly as possible, a prioritized review is the tool of choice. With this approach, documents with a high score are provided to the review team first. This workflow will enable the review team to identify potentially relevant documents in a short time, but the model will need to build precision over a longer time.

Coverage review

The alternative is a coverage review. This approach is intended to train the model as quickly as possible. Therefore, the documents rated least by the system (scoring close to 50%) are provided to the review team first. With this workflow, the model will gain precision faster and provide better quality results in terms of the suggested document scoring for relevancy.

The two approaches can also be used in combination. For example, a matter can be started in coverage review during an initial review phase, and, as soon as the model reaches the desired accuracy, the workflow can be switch to a prioritized review.

For the benefit of finishing the review before the entire document population has been reviewed, a dedicated test (Elusion Test) can support the decision whether or not the remaining document population still has a high probability of containing relevant documents. The test is based on a sample drawn from the scope of unreviewed documents classified as not relevant by the algorithm.. This sample is reviewed manually by the review team. To ensure a meaningful result, either a statistically representative sample size can be calculated by the system or a minimum fixed number of documents to be reviewed can be chosen. Based on the results of the review of the sample set, an estimation of the number of remaining relevant documents left in the review scope is calculated. If the probability of missing relevant documents (Elusion Rate) is low enough, the decision can be made to end the review at the current state.

The review can be restarted at any time after the termination of an Active Learning review. Likewise, any number of Elusion Tests can be performed.

The review should be accompanied by constant monitoring and reporting of the review process. Illustrative graphs visualize the progress and allow transparent reporting and control over the individual phases of the review.

For example, the chart above shows the scoring over a document population of more than 400,000 documents after a successfully completed Active Learning review. While the scoring of most documents is averaging in 50% range at the beginning of a matter, this model has clear peaks in the high and low percentage range of the scoring. This parabolic shape is one of the indicators for deciding of an Elusion Test is feasible, and an early review stop can be justified.


The application of Active Learning in document reviews is evolving to a well-accepted approach when conducting litigations. With the advantages at hand, it can support a wide range of use cases from small to large matters and does not require additional setup efforts in comparison to traditional linear and batch-based reviews. Active Learning can identify relevant documents faster, increases the review speed, reduces effort, and overall spend on manual review.





Aktuelle Beiträge