The Center for Data Innovation spoke with Christopher Wells, vice president of research and development at Indico Data. Indico is a Boston-based startup that uses AI and machine learning to give structure and meaning to unstructured data. Wells spoke about some of the costs associated with unstructured data.
Gillian Diebold: What is unstructured data?
Christopher Wells: There are lots of ways to define this ranging from a concrete list of examples (e.g., documents, images, etc.) to the very abstract (e.g., “lack of predefined data model”). I tend to split the difference and define unstructured data as any data that is not easily or usefully represented in a spreadsheet.
You could, of course, put pictures in cells in an “image” column in Excel, but you can’t do anything with that data beyond knowing if something is in that cell or not.
Diebold: How do artificial intelligence and machine learning help give structure to unstructured data?
Wells: The TLDR answer is: machine learning provides a medium in which the user can memorialize their understanding of what matters in their unstructured data. That understanding of what matters is what defines the structure that we want to come out of the unstructured data.
The longer answer is this: Referring back to my earlier Excel example, what you really want in that Excel workbook is a row per image that tells you something meaningful about the image. To go from unstructured (literally bits and bytes from the computer’s point of view) to structured you really need clarity on what is meaningful. This is where machine learning comes into play. We have powerful modeling techniques (I’ll put them all under the very broad “deep learning” umbrella) that can take unstructured data and infer useful features from it. Sometimes, those features are simple and obvious enough that a human doesn’t need to get involved, i.e., the machine can learn, unsupervised, that there are distinct categories in your data. More often, however, the user wants more structure than just “this is category A, not category B” and so the user needs to provide additional supervision to connect the model’s internal representation of the data to what the user cares about, e.g., you show the computer a receipt and tell it when you see this text, give me back the line items.
Diebold: What are some costs associated with unstructured data?
Wells: This is a very broad question, so my answer isn’t all-encompassing, but here’s a start to understanding the costs. There are the obvious costs driven by the mere existence of unstructured data: storage and management. For pretty much every species of unstructured data there are platforms and tools—which are not free of course—for accomplishing those tasks. Then there are the costs of unstructured data in motion. Organizations have entire teams of analysts who work all day every day with invoices or videos or pictures using their human intelligence to route that data to someone else or put details in a database or trigger other elements of a process. Those teams are expensive. This brings us to the next set of costs: automation for workflows driven by unstructured data. According to Gartner, only 20 percent of AI-enabled projects reach deployment. So, while being successful in doing so might dramatically reduce the cost of these teams and workflows, it’s easier said than done. Finally, there are risks. Existing processes pull some structure from your unstructured data and that gets stored in a system of record. However, those processes almost never record everything you might care about, and this is especially true when it comes to elements of that unstructured data which are hardest to represent in a structured way. As a concrete example, the removal of LIBOR as an interest rate benchmark sent 1,000s of organizations to work manually sifting through millions of documents representing trillions of dollars in transactions to find any language describing what to do if a LIBOR fixing was unavailable. Simple questions like: “what happens to this loan if LIBOR goes away?” are not simple to answer if the details are locked up in an unstructured format.
Diebold: How does Indico handle issues of explainability?
Wells: Our approach to explainability is a very practical one. We try very hard not to show the user information that they can’t do anything about. For example, it’s nice to know what your model’s F1 score is for correctly predicting programming languages on resumes. However, the value of that statistic alone doesn’t tell you what to do, so we pair the high-level statistical summary with the hard facts: here’s what you labeled and here’s what your model predicted. Our models train with few enough examples that you can quickly go through all of the test examples and identify patterns, e.g., we labeled this value inconsistently or there is a variant of this information that isn’t well represented in the training data. To sum this up, statistics are often the end goal for the data scientist, but for our users they are the starting point. The number tells you there is room to improve the model and the rest of the information we provide gives you what you need to know how to do so.
Diebold: How does the Indico platform improve business intelligence?
Wells: First off, our platform makes business intelligence possible with unstructured data. Our models and workflows allow users to represent their unstructured data in structured ways that are meaningful to them. Referring back to the LIBOR example, not only can you find the LIBOR language as a tactical matter, you can start to answer really obvious questions like, “what fraction of our contracts substitute PRIME for LIBOR?” You can also enrich the structure you’ve mined out of your unstructured corpus with important metadata like “who are the counterparties to a transaction.” This allows you to take the answer to your query above and filter it by counterparty or date ranges or whatever else you deem appropriate.