The Center for Data Innovation spoke with Jay Qi, lead data scientist at DrivenData, an organization based in Denver, Colorado that runs data science competitions to create AI solutions for social good. Qi discussed how DrivenData has helped develop models that can identify endangered species and how privacy-enhancing techniques can help unlock sensitive data for social good.
The interview has been edited.
Hodan Omaar: Can you talk about DataDriven’s machine learning competitions?
Jay Qi: In DrivenData’s online machine learning competitions, data scientists around the world compete to build the best algorithms for impactful real-world problems. The performance of different solutions are evaluated automatically and displayed on a live leaderboard, a structure that has been shown to increase the highest levels of performance and engagement achieved for machine learning problems. Our specialty—and the main thing that differentiates us from other machine learning competition platforms—is our focus on social good applications. Over the past 8 years, we’ve run over 65 competitions and awarded a total of more than $3.3 million in prizes. Our competitions span a wide range of application areas, from sustainability to health to social media moderation and more. We require code and documentation for winning models to be open-sourced in order to serve as an openly accessible and durable resource.
Our challenge partners use the competition format as a way to crowdsource finding solutions to their problem with skilled data scientists from around the world. Data science modeling is a famously iterative process, and a competition is an effective way to explore the solution space in parallel. If one’s problem has a good dataset and clear evaluation metrics, a competition provides much more exploration than what a single data scientist or even small team of data scientists would be able to get done.
For our community of data scientists, DrivenData’s competitions are a way for them to engage with interesting and impactful applications to practice their skills and potentially even win a prize. Defining a well-framed problem with a good dataset is a common barrier to starting a data science project, and we’ve done the initial heavy lifting.
The machine learning competitions are just one of many things that DrivenData does. We also consult directly with mission-driven organizations, created a popular data science project template, maintain many open-source software tools, and publish learning resources on our blog.
Omaar: What do you think are some of the most interesting real-world impacts you have had?
Qi: Our competitions span a pretty wide range of problems, and they are all interesting in their own way that can be hard to compare. To reference a few as a taste of the breadth:
- Our Hateful Memes Challenge in collaboration with Meta AI Research looked at identifying hateful content in social media posts that depended on both text and image content.
- Our VisioMel Challenge used digitized microscopic images of skin melanoma to predict likelihood of relapse of cancer.
- Our Snowcast Showdown challenge looked at estimating the fresh water contained in seasonal snowpack for water management and was evaluated live against data collected across the Western U.S. over the course of the winter in 2022.
- Our Mars Spectrometry competition in collaboration with NASA researchers involved analyzing geochemical data collected by the Curiosity rover on Mars.
One competition that I consider one of my personal favorites is our Where’s Whale-do competition. The task was to identify individual beluga whales from photographs of an endangered population that visit Cook Inlet every year near Anchorage, Alaska—something that research biologists at NOAA Fisheries otherwise have to painstakingly do manually.
Omaar: Part of what it seems you do is open an organization’s eyes to the potential of their own data. What stops organizations from seeing what you see in the first place?
Qi: Nowadays, I feel that most organizations have caught up to the idea that their data can have deep potential for empowering their work. Everyone has been talking about data and machine learning for years, and now everyone is excited about AI and large language models (LLMs). Figuring out what to do about that can still be hard, though!
In our data science consulting work, we work closely and collaboratively with partner organizations using principles from human-centered design to understand their needs and identify the right way to approach the problem. Ultimately, it’s important to focus on the problem being solved from the perspective of the stakeholders, rather than trying to use data for data’s sake. Something else that we’ve found especially helpful is being able to discuss similar use cases from our experiences with other organizations or that we’ve seen in industry. Having examples to ground discussions makes a big difference towards helping organizations understand what’s possible and what’s worth doing.
Additionally, making effective use of data does require investment in technology, processes, and staff. There is a fantastic article about the “Data Science Hierarchy of Needs”—a play on Maslow’s famous hierarchy of needs—that provides a helpful framework in thinking about what is fundamentally required to do successful data science. An organization needs to figure out data collection first, then data movement and storage infrastructure, then data cleaning, and so on. When we work with an organization, appropriately addressing where they are in the hierarchy of needs is critical to long-term success.
Omaar: If there was a type of data you could unlock to best serve social good, what would it be?
Qi: One of the challenges that we encounter is that impactful data can often be sensitive. It makes sense that useful data for helping people is also often about people, but data about people often runs into privacy and safety implications. We’ve spoken with stakeholders and leaders from organizations ranging from municipal governments and medical programs. They know that being able to collaborate with other organizations by sharing data could have a lot of benefits, but they just can’t do it because there are too many risks involved from a privacy and compliance standpoint. How can we use sensitive data about people in analysis or in machine learning, while also protecting their privacy? This is not a fully solved problem, though we’re excited to follow continued research in the field of privacy-enhancing technologies. There are a lot of promising approaches like differential privacy, federated learning, and homomorphic encryption that are being developed to address this problem. DrivenData has even participated in advancing research in the field—we’ve run a few competitions in partnership with NIST and other agencies to support research in differential privacy and privacy-preserving federated learning. Privacy-enhancing technologies aren’t yet ready for us to depend on them for running an open and public machine learning competition on sensitive data, but we’re eager for when that day comes.
Omaar: DrivenData maintains a number of popular open-source projects for data science. Can you briefly explain why this is important for innovation?
Qi: By having the winning models from our competitions be open source, anybody can use or build on the results, not just the competition sponsors, maximizing our social impact. We sometimes carry forward competition solutions towards becoming production-ready open source software, such as Project Zamba for supporting wildlife monitoring and the CyFi cyanobacteria finder tool for water managers. Many of our competitions also release open data after their conclusion to make further research and development possible. Any open code and open data additionally end up becoming a learning resource that builds capacity—not just in the particular machine learning task but in working with data in that social good application area.
Additionally, we feel that it’s important to contribute back to the body of open-source data science tooling that makes our work possible. That’s why we release and maintain open-source tools that we think can be generally useful like our data science project template, a Python library for accessing cloud file storage, or a diagramming tool for data models. More than just tools we also benefit a lot from the wide availability of data science learning resources freely available online, and we often contribute back with helpful blog posts, from a primer on satellite data to a guide to publishing Python packages to “Getting started” tutorials for our competitions.