Bridging the gap between structured and unstructured data

Beitrag als PDF (Download)

For forensic professionals handling corporate investigations, structured data and unstructured data have always been separated by a gap – the difficulty is linking or merging them so they can easily be cross-referenced and searched together.
It is a fact that eDiscovery processes bring order and searchability to unstructured data itself. Forensic data analysis can take raw structured data that investigators receive from a company -for example, extracts from ERP systems – and make them accessible to investigators even without access to the user interface of the original application.
However, the challenge has always been linking, correlating, and merging the huge quantities of unstructured data in your investigation one-to-one to structured transactional data. Below, I explain the steps required to make each side – the structured and unstructured data – useful on its own, then I describe why and how to bridge the gap and correlate the two types of data.

Defining eDiscovery on unstructured data
The term eDiscovery refers to the process of identification, preservation, collection, preparation, review, presentation, and production of electronically stored information (ESI) in connection with legal proceedings, internal investigations, regulatory investigations, or disputes. Through eDiscovery, unstructured data is indexed and structured, so that it can be reviewed and analyzed in an investigation. Typically, this relates to emails and loose files, including text files, spreadsheets, presentations, and other human-generated content. In the eDiscovery process, the relevant data is standardized and presented in a unified format on a review platform. This facilitates analysis and assures that relevant content can be identified more effectively. An effective eDiscovery process should fulfill the following functionality:

There are many potential data sources: PCs, laptops, servers, cloud storage, flash drive, smartphones, and hard copies are just some examples. Further complexity is often added because nearly every IT infrastructure is set-up slightly differently. Most companies, and often even parts of the same company, have their own configurations for servers, operating systems, applications, and backup solutions. All these different data sources must be identified in the discovery process.

Collection and preservation
After identifying all relevant data sources and media, you need to collect the data. A prerequisite for effectively and efficiently handling this process is a precise collection plan that prioritizes data sources according to their relevance and volatility. Such a plan also minimizes the impact to the client’s IT infrastructure and daily business operations during collection. You may need to collect hundreds of gigabytes or multiple terabytes. Apart from the technical considerations of handling such in order to prevent any breach of privacy or the disclosure of business secrets.

Review and analysis
The key step in the eDiscovery process is review and analysis of unstructured data. Modern tools offer technology assisted review (TAR) that uses intelligent algorithms and probability analysis to group millions of documents conceptually by modeling the decisions of the reviewer.
For example, suppose a data set has one million documents to review. Instead of dispatching teams of junior lawyers to comb through this material using individual searches, a single senior lawyer can determine document relevancy over, say, 500 or 1,000 documents. Meanwhile, TAR can use the senior lawyer’s decisions to make increasingly accurate coding decisions throughout the entire data set. This increases the quality and speed of identifying relevant documents in massive data sets, which decreases the human labor required.
Through these steps, unstructured data can be rendered highly useful to forensic investigators.

Forensic data analytics on structured data
On the other side of the gap is structured data. This data comes from tools such as ERP systems, document management systems, accounting systems, or transaction systems. It can be impractical and infeasible to replicate the full user interface of these tools. Often, these tools reside on huge server networks, require expensive licenses, or have proprietary customizations that investigators cannot re-create or utilize without huge amounts of time and resources. These systems’ user interfaces usually do not have the specialized search functionality required to perform forensic investigations.
Thus, investigators typically extract the raw data from these applications, which they then normalize and put into standard relational databases. There is a certain amount of logic built into the design of the database itself, but often that is not enough for investigators because there is also application logic that is not extractable from the raw data. Investigators then face the challenge of interpreting data that does not fully make sense without the application logic, requiring them to essentially reverse engineer the whole logic of the application. This process involves talking to the developers of the application, checking the customizations and reviewing documentation to understand how the application logic interacts with the data.
Then, investigators generate transformations on the raw data in ways that give them the same results outside of the application that they would get within it, tailored to the specific types of queries they want to perform. This task involves painstaking labor that requires a great deal of judgement.

Bridging the gap
When investigating fraud, money laundering, embezzlement, corruption, insider trading, or other forms of white-collar crime, it is often necessary to piece together the story of who knew what and when.
It can be difficult to know the true story if the financial transactions that a subject of the investigation made – which show up on the structured side – are not correlated to the subjects’ unstructured data trail.
The key is to correlate structured data such as transactions, ledgers from ERP systems, HR data, supply chains, production management, payroll, and information from social media to unstructured data such as email, office files, chat messages, audio, video, image files and cell phone data.
An ideal eDiscovery solution can effectively interpret, search, and link structured data. It is able to provide this functionality as part of the regular eDiscovery workflow, e.g. that transaction records can be imported just like any other file format, and they are integrated with – and searchable alongside – the related unstructured data.
Thus, through linking the various underlying types of data, a true trail of the subject’s actions emerges. Without the links, the trail often disappears. Furthermore, once the links are made, it becomes possible to create detailed visualizations of the connections to tease out correlations and stories that might have otherwise evaded detection.
Due to the unique and complex relationships between structured and unstructured data, there are a limited number of automated processes available to link them together. However, such linking can be achieved through careful planning and the right tools. Once the gap is bridged, whole new avenues of investigative thoroughness and power become possible.

Aktuelle Beiträge