Contemporary Challenges of Electronic Data Analysis

Beitrag als PDF (Download)

Times have changed since the Enron case, which to date remains the standard presentation data set for most eDiscovery tools. Today, we are no longer just dealing with ludicrous 500,000 American English emails and their attachments.

Setting the scene

More than 90% of the world’s electronic data was generated in the last 3 years. 2.5 quintillion bytes of electronic data are generated around the world every day, and this is predicted to increase to 175 zettabytes by 2025.

There are 18 zeros in a quintillion. And 21 zeros in a zettabyte.

2,500,000,000,000,000,000 bytes in 2020; and 175,000,000,000,000,000,000,000 in five years’ time.

Despite all the known, and as yet unknown, challenges of electronic discovery, eDiscovery can no longer be ignored. Courts and other regulatory authorities have accepted the risk, which in case of doubt is related to the accuracy of the quantification and the collection of the data.

In the early years of eDiscovery we still had to go through unstructured data, with considerable manual effort, in order to identify the relevant data. This was obviously very time-consuming.

A few years later, data was converted into electronic form in order to use document review platforms, search functions and CMML (Continuous Multi-Modal Learning) in order to reduce the reliance on human “reviewers”. In addition, the number of emails increased enormously, and these could by no means be ignored as part of the data investigations.

In the 21st century, artificial intelligence helps us to segment, filter and recognize patterns. Predictive coding and scalable cloud services are used to significantly reduce costs.

Legal and compliance issues must always be considered and are always challenging in an international environment. How can I quickly achieve initial results using Early Case Assessment? What does the atomization of information mean and how can I deal with it through de-duplication, and with which technologies?

Legal and Compliance

In the long run, GDPR will become a huge competitive advantage for Electronic Data Analysis, provided compliance does not make it more complicated and expensive. Stakeholder management from a legal and compliance perspective is vital. The use of cloud services is still seen as problematic.


The EU General Data Protection Regulation (GDPR) obviously imposes a host of enhanced obligations on anyone (a corporation, legal team, etc.) who collects and processes data, and it also provides data subjects with greater control over the use of their personal data.

In cross-border cases a simple “Just go ahead and collect the data, send it over to us, and we’ll look after it” doesn’t work.

There are countries with a comparable level of protection, like Ireland or the Netherlands, but in Germany we mostly rely on “in-country” processing and hosting along with pre-filtering, to definitively exclude data privacy-violating transfer to a third country, most straight forward by engaging local resources to process local data respecting local data privacy restrictions. Gone are the days where international legal advisors were able to prioritize the free movement of data around the globe to best fit their strategies and resources over current-day compliance regulations on the very movement of that data.

Unfortunately, even today, European enterprises continue to allow the unfiltered and unprocessed transfer of data in jurisdictional proceedings outside of the scope of GDPR, despite the penalties that are in place.

In addition to the protection of personal data that falls within the realm of export control, banking secrecy, intellectual property protection, and other GDPR-governed data, eDiscovery service providers are being asked more than ever to advise their clients and their legal counsel on the meaningful, robust, and secure methodologies available today.

Stakeholder Management

In this context, it is a question of leveraging the right approach, the right workflows, and the appropriate tools, with each playing a very important role in tackling the data privacy challenge. Depending on the procedure and technology, communication challenges must also be considered.

Who informs whom, by when, and what role do employees have?

Which stakeholders need to be considered? For example:

  • Data Protection
  • Data Security and Risk
  • IT
  • Legal Department
  • Employee representatives and – not to forget –
  • the Works Council (as the shop-floor organization in many European countries that represents workers and employees and functions as a company or firm-level complement to trade unions)

Cloud Usage

The legal market players in Germany have remained extremely cloud-critical in the recent past, both on the part of law firms and clients. The emergence of virtual marketplaces (Reynen Court or Digitale Ökosystem Recht (DIKE) based on GAIA-X under the consortium leadership of the Liquid Legal Institute), the emergence of de facto industry standards (such as Litera Transact for checklist management or HighQ, the Swiss army knife among collaboration tools) and the progressive consolidation of providers (Litera or Thomson Reuters) as well as the fundamental virtualisation of the working world (Microsoft Teams or Slack for chat, audio and video conferencing, screen and file sharing) are taking place almost entirely in the cloud. Market participants will have to swallow the (perceived) bitter pill if they want to participate in the upheavals and innovation. Everything can be designed securely in terms of IT technology, data protection and professional secrecy, including §203 StGB and §43e BRAO. Even with the largest providers of standard software, it is tedious, cumbersome, elaborate and takes longer than expected to find a common denominator to adapt their processes and agree on the paperwork – but it is feasible.

Early Case Assessment (ECA)

Tell me what’s there! Is there a risk? How big is the risk?

Early Case Assessment refers to estimating risk, time, and money as well as the prosecution or defence of a legal case. The benefit of thorough ECA is saving money. It can help organizations to either save money across the entire process of the Electronic Discovery Reference Model or in the light of litigation, for example.

Discovering and acknowledging the critical data before others do may even prevent reputational damage

No-one nowadays can expect that an eDiscovery consultancy, vendor, or a law firm providing eDiscovery services, litigation and anti-trust support “can lock themselves up in a room for weeks” for i.e., 4 TB of data to be searched in order to then give an indication of how critical the case actually is.

This means that you need different tools:

  • Classic eDiscovery tools are helping to find where specific information sits
  • ECA tools are helping to understand what information is there
  • The expectation is that the supporter will quickly enable the defendant or client to understand the risk, scope, and size of the problem, ideally by following the ECA lifecycle which typically includes all of the following:
  • Performing a risk-benefit analysis
  • Placing and managing a legal hold on potential documents (paper and ESI) in appropriate countries
  • Preserving information abroad
  • Gathering relevant information for attorney and expert document review
  • Processing potentially relevant information for the purposes of filtering, search term, or data analytics
  • Information hosting for attorney and expert document review, commenting, redaction
  • Producing documents on parties in the case
  • Reusing information in future cases

ECA is a comprehensive and holistic evaluation of legal liability and potential costs at the outset of a case. In addition to looking at the relevant data, ECA comprises comparing matters against similar past matters, decisions about what counsel shall retain and looking at previous court rulings to assess the viability of a matter.

Atomization of Information

Do you use beA or XNP? Have you ever answered a call with a text message? Does your corporate IP telephony system send voicemail messages via email? Does your employer try to manage email overload with collaboration platforms like Microsoft Teams or Slack? Has a client ever asked you to communicate via WhatsApp, WeChat, Signal, Threema, Instagram, Discord, Line, Facebook, or iMessage? Do you use virtual data rooms such as Ansarada, Datasite (formerly Merrill), Drooms, Imprima or any of the myriad others? Or transaction platforms like HighQ or Litera Transact? What about electronic signatures like Adobe Sign or DocuSign?

If your answer to all these questions is always “no”, then you do not need to read any further.

Communication does not happen in a continuous stream (anymore), and documents are not stored in one or a few systems. In the document life cycle, they are in a wide variety of internal and external storage and archive systems, flying through various communication channels in between. There are different versions of documents, of which in turn there may be several renditions (a Word document, the PDF generated from it, the same content on a website in HTML form, or a scan of it, for example). And then, of course, there are many media discontinuities: printouts with and without manual notes, blackened PDFs, signed in wet ink or certified documents, photocopies, scans, or even faxes.

How do you find the relevant needle in this haystack? We can save a lot of time and effort if we only look at the relevant documents and emails, not all variants and variations. In the next chapter we will set out how to reduce the volume through de-duplication.


There are always duplicates in the myriad of documents, emails and email threads, text messages, audio and video files that must be searched. Some duplicates are easy to detect, others are difficult or cannot (yet) be identified via an automated process.

Identical files

What are hash values and why can they also be relevant in eDiscovery? Hash values are often mentioned in connection with IT security and consist of (according to the most common cryptographic functions) at least 32 characters that do not allow any conclusions to be drawn about the original name of a file or its content (e.g., a hash value according to the MD5 function: “f1db32c3cdd736117ed924a86d7b5f8d”). Hash values are formed – depending on the function used – according to more or less complicated constructions.

In practice such (cryptographic) hash functions have the advantage that it is almost impossible for different files to result in an identical hash value. This means that each file ends up with an unambiguous name, which makes it easier to check whether files have been changed afterwards. In addition, hash functions enable large amounts of data to be searched for identical duplicates, for example. Therefore, hash values are particularly interesting or even essential for criminal authorities, but also in the area of eDiscovery.

However, it is worth noting that the courts require the hash values of the individual files to be listed in a separate appendix in a comprehensible manner.

In addition, it remains to be seen whether the huge amount of work this may involve would actually make sense.

Identical content in non-identical files

A Word document, the PDF generated from it, and a scan of it, always contain the exact same content, for example. This document as an attachment in an email, or when it has been printed, or when the metadata has been cleansed, are also textually identical. But the files are technically very different and cannot be matched via their hash value. You need robust tools to bring the textual content to the surface and recognize such renditions as identical.

It becomes even more difficult when there are versions of documents that are very similar, but not identical, in content. What is the latest, what is the official version? Often there are unsigned documents, one version signed by one side and one version signed by the other, or both. Or draft documents, notarized/stamped/sealed documents, etc. Some tools or toolkits are able to find these “duplicates” and present them together, as well as cluster similar documents, e.g., contracts based on the same template.

Email Threads

An email thread is a single email conversation that begins with an original email (the start of the conversation) and contains all subsequent replies and forwards to that original email.

There may be data inconsistencies, such as timestamp differences generated by different servers in different time zones, which may lead to misinterpretation. Also, an original email may end up in different sub-threads.

Appropriate tools determine which emails belong to the inclusive (i.e. that they contain unique content) or so-called leaf emails (the leaf at the end of the branch) and should be reviewed. This reduces the time and complexity of reviewing emails by consolidating all forwards, replies and reply-all messages. It also helps to identify email relationships, who was in contact with whom and when, etc.

Translations and Transcriptions

This is the hardest and newest challenge in de-duplication. We recently had a situation where we were confronted with video, audio, and text files in multiple languages. For some pieces of evidence there were video files in different formats, several stills from those videos in multiple resolutions, audio file snippets from those videos, as well as manual minutes and machine transcripts from them. The original material was in different languages, the minutes and transcripts were translated manually and by machine. Luckily enough these files had speaking filenames and were presented in a well-structured way. Had this not been the case, it would have been impossible to automatically identify such “duplicates” using any existing tool.


The following describes some underlying core technologies that are essential for conducting electronic data analysis, as well as their advantages and limitations. They create real value in a well-orchestrated interplay.

Regular Expressions (RegEx)

A Regular Expression is a string that serves to describe sets of strings with the help of syntactic rules. The most common application of RegEx is wildcards. RegExes are ideally suited to finding dates or monetary sums in documents.

Machine Learning (ML)

An artificial system that learns from examples and can generalize these after the learning phase is called Machine Learning. The ML algorithms build a statistical model based on training data. It does not simply learn by heart: patterns and regularities are recognized in the learning data. This helps, for example, to classify document types by their appearance. Or to recognize cat pictures.

Natural Language Processing (NLP)

NLP algorithms understand the sense of the sentences in text files. The sentences are atomized into their segments, personal forms or capitalisation are analysed, syntax is identified (e.g. subject, object, article) and semantic meaning is assigned to their parts. This is a good method for identifying information to be anonymized/pseudonymized or for machine translation for example.

Optical Character Recognition (OCR)

Automated text recognition within images is now commodity and the basis for all the technologies mentioned above. In the beginning, accents (à, ç, ë, î, ø, š, ù etc.), ligatures (æ, œ, ꝏ, ffi, ffl, ß etc.) or non-Latin characters were problematic. Today, NLP, ML and dictionaries are used to improve recognition rates beyond ASCII. Text in columns, hyphenation, or initials at the beginning of a magazine article are also well recognized by most implementations. Handwriting is also reliably detected and reasonably understood.

Multilingual text, tables, headers, footers, and footnotes are still major challenges in normal documents. Paragraphs that cross page boundaries can get mixed up with headers and footers, or table columns without large spacing do not always allow the correct reading flow to be detected. Frames, underlines, strikethroughs of individual words, lines or entire paragraphs also continue to confuse OCR engines.

Many eDiscovery tools generate their own OCR via an image of the documents, even if the PDFs already contain recognized text, for example. This is to ensure that the full text content of the document is included and that any flaws from older OCR implementations do not negatively affect the quality.


An open standard used to represent text recognition results is hOCR. In addition to the text, its layout, recognition accuracy, formatting and other information can be recorded. This metadata is stored in HTML/XHTML in special meta tags. hOCR is used by most tech giants like Apple, Google, Microsoft within the standard free software packages in their OCR implementations.


ALTO (Analyzed Layout and Text Object) is an open XML scheme for describing the layout information of digitized objects. It was developed for the description of OCR recognition results text and layout on the page level of digitized materials. The aim was to describe the text and layout in such a way that a reconstruction based on digitized material would be possible. ALTO is the de facto standard for text digitization projects in Germany.

Languages and Fonts

Many eDiscovery and Legal Tech tools use Tesseract, the most common free software component for text recognition. It is mainly used for the recognition of text characters or text lines, but Tesseract can also carry out the decomposition of a text into text blocks (layout analysis) as discussed above. Text recognition data for well over 100 languages and language variants are available. It supports not only Latin Antiqua fonts as we use in Germany today, but also Fraktur, Arabic, (simplified or traditional) Chinese, Cyrillic, Greek, Hebrew, Indian, and other fonts.

Key Learnings

It remains challenging.

Legal and compliance issues can be eliminated if done carefully. GDPR can become a competitive advantage for EU/EEC providers – if the others do not adapt quickly and properly. And legally compliant cloud usage is possible with professional providers, but still tedious.

Early Case Assessment can help to quickly get on the right track.

The atomization of information can be managed using tool-supported de-duplication.

Reliable technology exists to resolve most of the tasks with high-quality, on-time and cost-efficient. The combination of the tools, processes, and people that execute them is key and determines the value generated.

Aktuelle Beiträge