The Center for Data Innovation spoke with Mark Silverman, CEO of Treeminer, a data mining company based in Rockville, Maryland. Silverman discussed how Treeminer’s innovative approach to mining data addresses problems faced by companies overwhelmed by their data as well as how Treeminer is working to mine data from the Internet of Things.
This interview is lightly edited.
Joshua New: Can you introduce Treeminer and talk a little bit about your role?
Mark Silverman: I founded Treeminer to bring a novel approach to the challenges of data mining to market. I met my partner, a professor at North Dakota State University, who recognized that the vast increases in the amount of data being collected required a fundamentally different approach to data mining extremely large datasets. I came at this from the perspective of someone who very successfully applied data mining methods to a global wireless services company, increasing profitability, reducing errors, and identifying major business risks through an analysis of tens of millions of call records. Treeminer was built with the notion that far more data is collected today than can be put to practical use due to limitations in the underlying technical approaches taken by the most popular data mining methods in use today.
New: Could you explain the “analysis bottleneck” and the problems it causes for companies trying to make use of their data?
Silverman: More data will be collected this year than in the entirety of human existence up to this point. This data comes in a variety of formats: user generated, machine generated, etc. Within this data are hidden nuggets of knowledge that can be hugely beneficial if acted upon with the appropriate urgency. We refer to the “analysis gap” or “bottleneck” as the growing intervals of time from when data is available, to when action can be taken based on the data collected.
Closing this “gap” and making efficient use of the data modern organizations have available to them requires addressing a number of challenges:
- What do you need to do in order to extract benefit from the data? Data models today are very complex to build and require significant effort from data scientists to prepare and validate. Our approach dramatically simplifies the construction of models and enables new data analysis paradigms.
- How long does it take to run the data mining algorithm? Existing approaches tend to scale exponentially with large datasets—this is a fundamental advantage of the vertical approach we take. Complex distance and kernel calculations in existing methods are translated into simple Boolean operations on vertical strands of data in our approach yielding significant performance improvement.
- What information can you get out of your data mining approach? Our approach implements “deep learning” concepts which can find patterns in the data by tackling issues such dimensionality reduction and automatic clustering of data without construction of models.
- Can heterogeneous data types be handled? Because of our basic vertical algebra algorithm, we are agnostic as to data types and don’t require expensive and complex normalization of data to operate.
These four key advantages directly address the challenges of how to close this “gap” and extract knowledge from massive volumes of data.
New: Can you elaborate on Treeminer’s unique “vertical” approach?
Silverman: We call ourselves the first “vertical data mining company.” If you think about how data scales, generally, you see many more “instances” of data than “attributes” of the data. For example, a firewall may generate tens of millions of log records, each of which may contain only a dozen attributes. When you start multiplying that by the number of security devices in a network, you are talking about massive volumes of data within which may contain hidden information about an attack.
What we realized is that if you iterate over the “attributes” (or “columns”) of data rather than the “rows” of data, the scaling problem changes. In the firewall log example you have a dozen attributes no matter how many log entries you have, so we iterate over the dozen attributes, not the millions and millions of entries. We can produce results in a fraction of the time that competitive methods can.
We’ve built a knowledge mining “algebra” upon which we implement the sophisticated mathematics needed to produce predictive classification results by iterating only across columns of data. In some sense, we do for data mining what vertical databases do for information retrieval—make it much more efficient and scalable.
The interesting thing that we’ve found out as we go along is that faster results does not mean poorer results here. We’re able to produce results with equal or better accuracy than traditional approaches.
New: What sort of industries and functions benefit from this approach?
The challenge that we have as a small startup is that these techniques are applicable across a range of industries and problem sets. Almost everyone has a ton of data and is not digging into it as fast or effectively as they would like.
We’ve worked across a number of sectors and applications, including text document classification, image classification, machine sensor data, network traffic, etc. We work equally well in large batch processing environments and real time streaming data environments. We have customers in both the government and commercial sectors.
New: The Internet of Things has created a massive influx of data being collected, however this data comes from a wide array of sensor types and is highly heterogeneous. How does Treeminer address this challenge to make this data usable?
Silverman: We see this as a huge opportunity. This kind of data is challenging for traditional approaches not only because of heterogeneous nature of the data as you mention, but also because of the volume and relatively low dimensionality of the data.
Our vertical approach at its heart operates on data in what we call “bit strands” (or “ptrees”)—strands of vertical bit columns. Because of this approach, we basically don’t care what is inside the data and don’t really need to know—we’re looking for patterns across the bit strands. Our internally developed predictive algorithms can adapt. All we need to do is translate the data into these ptrees, which we’ve been able to do for text data, machine data, log and sensor data, image data, geographic data, etc. Once it’s organized this way, our algorithms don’t care.
As an example, we are currently implementing such a system using a streaming machine sensor data including vibrational data, temperature data, and sound to predict when a machine may require preventative maintenance or repair solely based on the current operating condition of the machine. Here we have quite heterogeneous data and leverage the advantages of our vertical approach to provide feedback real time as data streams in.