The Center for Data Innovation spoke with Jeff Jonas, an IBM Fellow and Chief Scientist of the IBM Entity Analytics Group. We talked about his main project, codenamed G2, as well as the future of structuring unstructured data and why he thinks computers need to learn how to dream.
Travis Korte: For those who may be unfamiliar, could you briefly introduce some of the work you do with IBM, and specifically touch on your goals with the G2 project?
Jeff Jonas: While I wear many hats at IBM, my most important and most exciting role is that of an innovator. In this job—actually a hobby for me—I try to dream up high-value, deeply differentiating technology for IBM. My current big bet, originally code named G2, is now 5 years along. G2 is my main thing these days.
In simple terms G2 software is designed to integrate diverse observations (data) as it arrives, in real-time. G2 does this incrementally, piece by piece, much in the same way you would put a puzzle together at home. And just like at home, the more puzzle pieces integrated into the puzzle, the more complete the picture. The more complete the picture, the better the ability to make sense of what has happened in the past, what is happening now, and what may come next. Users of G2 technology will be more efficient, deliver high quality outcomes, and ultimately will be more competitive.
Some of the above words are extracted from here: G2 | Sensemaking – Two Years Old Today.
TK: You’ve mentioned your hope that the G2 project will have “horizontal applications,” outside the domains for which it is being applied or piloted today. Can you describe some of these?
JJ: Today, G2 is being used in a number of domains ranging from improving Anti-Money Laundering (AML) systems at financial institutions to modernizing voter registration in America. What is exciting for me is every user of G2 today could pretty much all be running a single instance of G2—same G2 code, same database, same context accumulation principles. The notion of puzzle pieces to pictures as a general principle works well, as, from a G2 point of view, it is indifferent if it is assembling pictures of banking customers, voters, vessels, cars or asteroids for that matter.
As G2 continues to develop and an army of talented engineers let their imaginations run wild about what is possible, I hope to see G2 being applied to such things as oncology research, energy conservation, and the protection of the satellites from the growing space junk problem.
TK: More generally, as someone whose work has been applied in a broad range of fields, what domains or industries do you think have the greatest potential to benefit from large-scale data analytics?
JJ: There seems to be a lot of interest in one particular use case: Using G2 as an attention directing system. There is only so much time in the day. How many suggestions do we really want machines throwing at us these days? Ideally, just the great ideas and not too many of them. On-line ads make for one example—each ad pushed at you is an attempt to get your attention. Well for the most parts ads are a nuisance, because they are not often relevant. One use of G2 will be to more carefully select what matters to who. One day in the future you may say to yourself, “wow, these ads are really clever and useful.” G2, or something like it, will likely be behind this kind of improvement. Better ads are an example of using attention directing systems for opportunity; equally, there is an opportunity to use G2 for better risk management too. Organizations like banks have an obligation to keep banned parties from transacting with them. Unfortunately, the systems used today to help with this triage process are producing lots of false alarms. False alarms waste time as they misdirect the human analyst charged with investigating each alert. Using G2 technology the quality of leads in the alert queues will change. In fact, the top item in the queue will really be the most interesting item of the day. This is another example of G2 as an attention directing system; in this case for risk.
Related reading: When Risk Assessment is the Risk
TK: You’ve written before about your vision for automatically extracting features from unstructured data, which to me sounds a lot like the way humans operate. But the narrative that computer vision (or sentiment analysis, or content-aware search) is hard still seems to dominate among a lot of applied machine learning researchers. Are they missing something? What’s keeping us from the big advancements in feature extraction that will be needed to develop these applications?
JJ: I just wrote about this? How funny as I blog so rarely: Structuring Unstructured Data
But to answer your question, I believe feature and entity extraction algorithms have not really had any great advances in recent decades. It makes me think the field is attacking the problem the wrong way. Have you ever heard the phrase “climbing trees to get to the moon?” I do have a STRONG hunch what is missing and how to fix it. But this is a rather long story and gets a bit technical—I start talking about how biology solves this, then discuss the way the neocortex is wired to the hippocampus…yada yada.
Here is a hint—item #5 in this post: Context: A Must-Have and Thoughts on Getting Some …
TK: And, in a larger sense, can you speak about some of the things you’ve learned in creating systems that take aspects of human intuition for inspiration? What sorts of things can humans do that are still hard for machines—and what’s easier than you would have expected?
JJ: Good question. Tricky question though. The way it has worked for me—I knew nothing about the brain and human cognition—as I have been building these systems that context accumulate I run into weird little discoveries: the discovery that errors in the data are helpful, more data remembered means faster compute is possible, optimization challenges, and so on. Often when I hit these I start thinking about how does human cognition work, how has biology solved this OR NOT. Sometimes I reach out to neuroscientists in academia to share what I am seeing, in hopes they can share any parallels from their field. On this journey I find I can hold some serious debate about what is happening in the brain—in some small way, by building and using such a system I run into and solve (at a primitive level for sure) things at times in a manner very similar to how biology seems to have solved it. For example, over the years I learned these system that I build must favor the false negative (being cautious rather than optimistic) when integrating new observations. Well a professor in the field tells me one of the top layers of the hippocampus first tries to negate things…i.e., “this can’t be true, this can’t be true.” I have found it must be this way. So I am comforted by this parallel. Then there was this time I realized someday my G2 is going to have to get some sleep and do some dreaming (off-line, natch)! More about this here: Accumulating Context: Now or Never…here is an excerpt:
Now not to be too abstract here, but while I have been harping on the importance of creating Sequence Neutral processes—no trivial feat in real-time context engines—I am coming to the conclusion that a few aspects of Sequence Neutrality cannot be handled on data streams at ingestion! While this gives me a sinking feeling about the consequences this has to Scalability and Sustainability (i.e., no reloading, no batch processing), I am somewhat comforted by the fact that smart biological systems at the top of the food chain themselves go off-line for batch processing (i.e., sleep). I’m theorizing that dreams are in fact species’ effort to re-contextualize that information which could not be ingested with Sequence Neutrality. Because if humans could do this while being awake, from a survival and evolutionary stand point, we would!
Photo: Flickr user Joi Ito