The Center for Data Innovation spoke with Edwin Chen, CEO of Surge AI, a data labeling platform that helps top companies and research labs around the world gather high-quality datasets for AI models. Chen discussed the importance of data annotation for building accurate AI models.
Gillian Diebold: What is data labeling, and why is it important?
Edwin Chen: Imagine you have 100,000 tweets, and you want to train a hate speech classifier. In order to train your machine learning model, you need to build a training set consisting of tweets that contain hate speech and tweets that don’t. Data labeling is the process of asking humans to annotate those tweets—or other datasets in general—with extra dimensions, like “Does this tweet contain hate speech?” or “What kind of hate speech does this tweet contain? Racism, sexism, violence, …”, or “Rewrite this tweet to make it not hateful.”
Other examples of data labeling include: categorizing the sentiment of user reviews, tagging financial transcripts with the names of companies and locations, evaluating whether the outputs of large language models are honest and safe, and annotating customer support tickets with accurate outcomes.
Data labeling is important because AI models are only as good as the data that you feed them. If you feed your models poor data, then they’ll mimic the bad data and give inaccurate predictions. This is a severe problem when so many important products and services depend on AI—whether it’s the content moderation algorithms at YouTube and Twitter, customer support systems at Uber and Amazon, or search engines at Google and Facebook. For example, we’ve investigated many hate speech and toxicity models and found that many of them are merely profanity detectors, even though there is a lot of hateful speech on the Internet that doesn’t contain profanity, and many people and many communities can even use profanity in positive ways.
Diebold: Why do so many large datasets have annotation issues?
Chen: The problem is that companies, even massive technology companies like Google and Meta, lack the sophisticated data labeling infrastructure they need to create good datasets. For example, in order to use Google’s “data labeling tools,” you would have to write custom Python and Go code that would have to be reviewed by an internal team and wait for the next production deployment. Running even a small data labeling project could easily take months to get started. And this doesn’t account for the fact that humans are often fallible or even deliberate spammers! You need state-of-the-art quality control intelligence in order to extract the high quality that AI teams need.
Diebold: Can you explain some of the technology used by Surge AI to improve companies’ data labeling?
Chen: Technology powers our core products in four ways. First, many companies perform all their data labeling in spreadsheets. This causes errors, inefficiency, and scaling issues. You also can’t perform more sophisticated tasks like named entity recognition tagging this way. We provide rich, fully customizable data labeling templates that allow you to gather data in beautiful user interfaces.
Second, we have easy-to-use APIs that make it easy to create labeling tasks programmatically. This is helpful because we think of a lot of our work as “human computation” or “AWS for human intelligence.”
Like I mentioned above, quality control is often an adversarial problem, similar to email spam. We build sophisticated machine learning infrastructure to flag human errors and fix them. In general, we think of ourselves as a “human/AI company” where humans and AI work together to improve each other.
Lastly, for customers that enable it, we also have a “human/AI-in-the-loop” infrastructure that allows machine learning models to take over more and more of the labeling process as they send more data and our algorithms become more accurate. This means that customers save time and money; i.e., instead of costs scaling linearly, labeling becomes increasingly efficient as they send more data.
Diebold: You found that nearly one-third of Google’s “GoEmotions” dataset is mislabeled. Can you talk about that process, your findings, and the impact of mislabeled data?
Chen: We were playing around with training a model on the dataset, which is a collection of Reddit comments labeled by emotion, but noticed the model performed shockingly poorly. Our first thought was to investigate the dataset, so we took 1,000 random comments, asked our team to evaluate whether the emotion was a reasonable fit, and found that 30 percent of the emotions were mislabeled.
For example, here are some egregious errors we found in Google’s dataset:
- LETS FUCKING GOOOOO – mislabeled as ANGER, likely because low-quality labelers misunderstand English slang and mislabel any profanity as a negative emotion.
- *aggressively tells friend I love them* – mislabeled as ANGER
- hell yeah my brother – mislabeled as ANNOYANCE
The crazy thing is that these are the exact opposite of the correct emotion, and this came from research specifically dedicated to creating a dataset from one of the top AI companies in the world. The problem with low-quality datasets like these is that they cause your machine learning models to perform ineffectively, but they also mean your performance evaluation metrics are meaningless.
Diebold: What are the most interesting datasets you’ve encountered?
Chen: One of my favorite datasets to build is our toxicity dataset. This dataset is fascinating because toxicity is such a tricky problem. For example, what’s toxic or not toxic today completely changes from day to day—think of phrases like “Let’s go, Brandon!”, which would have been completely innocuous a year ago, or how so much hate speech centers around the latest politics and COVID news, which are constantly changing.
Because people have different views on what’s toxic or not, it’s also important to capture the range of human preferences in the dataset and make sure that it’s not biased toward any one political group or demographic. So there are a lot of interesting questions and best practices around building the dataset and making sure that AI models trained on it capture the subtleties of human behavior and language.