The Center for Data Innovation spoke with Nick Elprin, founder of Domino Data Lab, a data science platform company based in San Francisco, California. Elprin described the three stages of the data science lifecycle, as well as how Domino allows nonprofits to use data science for good.
Joshua New: You’ve described the ideal data science platform as one that addresses all of the demands of what you call the data science lifecycle. Could you explain this lifecycle?
Nick Elprin: From our point of view, an organization investing in data science must facilitate three kinds of work.
“Ideation” is the first phase. It involves exploratory work and is typically done interactively. Data scientists explore data sets, plot quick visualizations to understand characteristics of the data, and might run basic tests to develop a sense for research avenues that are worth pursuing.
“Experimentation” is next, and is when teams test ideas during the model development process. Research becomes more formalized, and teams run experiments, review results and adapt their approach based on their results. This is where the “science” in data science is emphasized the most. It’s when teams track variations in experiments, ensure past results are reproducible, and get feedback through peer reviews.
“Deployment” or “productionization,” is the last phase, and involves integrating data science output with business processes so that actual decisions are impacted. This may involve integration with a human process, such as a report that gets sent to a team each day, or integration with an automated system, such as an API. But at the end of the day, if the business isn’t doing something differently as a result of your data science output, then it hasn’t created any value.
In practice, those phases of work are not necessarily completely discrete, and of course within each stage there can be a great deal of iteration. For example, after many phases of experimentation, a team might conclude that the idea they were pursuing won’t work, and then begin exploring new approaches.
The point of discussing these types of work separately is not that they are cleanly separable, but rather, each type of work has different requirements and calls for different types of tools. A data science platform should enable all three.
More broadly, a data science platform should enable all three types of work to happen in the same place. Doing so allows teams and organizations to preserve and build upon work more effectively. That ability to preserve and reuse work is the fourth critical capability a data science platform should provide: across the types of work described above, teams should be able to find and reuse past work, to collaborate with each other, and to trace the lineage of projects as they evolve.
New: I’ve seen Domino Data Labs described as a “GitHub for data science.” What does this mean and why is there a need for this kind of platform?
Elprin: That description is accurate in some ways and incomplete in other ways.
Domino and GitHub are alike in that both products serve a need of facilitating collaboration for people doing a certain type of work. For example, when teams want to collaborate and share work, sales teams use Salesforce, engineering teams use GitHub, and data science teams use Domino.
The specific features of Domino are fairly different from the features of GitHub, however. GitHub tracks source code, while Domino tracks all the artifacts associated with data science experiments, including results, parameters, data sets, compute environments, etc. In order to do this, Domino provides an entire scalable compute infrastructure, so it can run data science experiments. Domino also provides a place to deploy or productionize data science models. To be more complete, Domino’s functionality is more analogous to what you’d get if you combined GitHub, a continuous integration server, and Heroku, which is a cloud platform-as-a-service, and then tailored all of that to data science workflows.
New: Domino also runs a program called Domino for Good that provides your platform for free or discounted rates for education and charity-related initiatives. Could you describe some of these? How could a charity benefit from data science as a service?
Elprin: One of the most exciting aspects of our work is getting a front-row seat to see how data science is being applied across industries and problem domains. Nonprofits face interesting problems and challenges just like commercial organizations do, and they are finding that data science can help them make faster progress toward their goals.
We’re currently working with nonprofits including Data for Democracy, The National Audubon Society, MDRC, and Thorn, and educational institutions such as Galvanize, University of Virginia, and Stevens Institute of Technology. For example, Thorn is a nonprofit that is fighting child trafficking. One aspect of its work with data science is building natural language processing to predict when escort service ads are promoting children rather than adults. The National Audubon Society, as another example, is doing analysis on bird population movements to better measure the impact of climate change
New: Domino has a handful of interesting case studies on its website, ranging from helping the City of Chicago reduce the spread of foodborne illness to helping Mashable use natural language processing in their journalism. Could you describe a case study you find particularly interesting?
Elprin: Many of our customers are in the insurance industry, building models for everything from pricing insurance policies, to predicting driver safety based on sensor data, to determining the risk of flooding in a certain region. One of our customers, KatRisk, provides comprehensive and cost effective catastrophe risk models to its clients. KatRisk developed some of the most sophisticated models for predicting flood and wind risk, written in R. We started working with them when one of its largest customers, a reinsurance company, wanted to use these models in an automated system for pricing policies.
To support that use case, KatRisk needed a way to deploy its R models as an API, one of the exact features Domino offers. Without Domino, KatRisk would have needed to invest months of engineering time to build a production-grade system for hosting its models. With Domino, KatRisk was up and running within a few days.
New: Gartner conducted an industry review of data science platforms and categorized Domino Data Lab as a “visionary,” instead of placing it in the other categories of “leader,” “challenger,” or “niche player.” What makes Domino “visionary?” What do you do that other data science platforms don’t?
Elprin: Many of the products in Gartner’s review are focused on providing tools to individual data scientists. Domino is focused on enabling a more disciplined, scalable process for doing data science across teams and organizations. To use the GitHub analogy from earlier, if Domino is like GitHub, other products in Gartner’s survey would be like individual developer tools such as integrated development environments (IDEs).
Domino is not entirely unique in focusing on enabling a more mature, collaborative data science process. Beyond that, one differentiating aspect of Domino is that we have built an open platform so users can leverage the growing open source data science ecosystem. Instead of forcing data scientists to work in a proprietary programming language or GUI, Domino lets them use any language they would like, such as R, Python, Scala, and many more. Practitioners get all the benefits of working in a central place with collaboration and reproducibility, and they can use their preferred languages, packages, and tools.
We believe that the combination of these two goals—enabling a more collaborative process for teams, and providing an open platform that allows for open source flexibility—makes Domino a visionary product.