The Center for Data Innovation spoke with Chris Danforth, applied mathematician and co-director of the University of Vermont’s Computational Story Lab. Danforth discussed how his team can measure the happiness of the Internet and what we can learn from analyzing the sentiment of Twitter posts.
This interview has been edited.
Josh New: What is the Computational Story lab? What kind of issues you work on?
Chris Danforth: The Computational Story Lab is a group of applied mathematicians working on large-scale, system-level problems in many fields including sociology, nonlinear dynamics, networks, ecology, and physics. Peter Dodds and I co-direct the group, and our work has quantified the role of social influence in how ideas spread, improved weather forecast models using chaos theory, and built Hedonometer, a tool that can measure the happiness of large populations in real time.
New: Hedonometer is fascinating. How does it work, and how do you manage all of the data that feeds into it? and how you manage all the data that feeds into it?
Danforth: Hedonometer is a sentiment analysis program that looks at Twitter data to measure public sentiment in real-time, offering something akin to an emotional weather report each day. We receive roughly 50 million tweets each day. We started the project in 2008, and have parsed a total of almost 1 trillion words from the service over the last seven years. The average happiness of these words is calculated and reported at various temporal and spatial scales, such as countries, states, and cities..
New: How do you approach assigning and analyzing subjective values like “happiness” to such a large volume of data?
Danforth: We gathered the most frequently used 10,000 words in 10 languages, and asked native speakers to rate these words on a scale of happy to sad. Each word was rated by at least 50 people, and their average score is the weight assigned by our instrument. Since words are considered by the participants and by our instrument out of context, there are some limitations. However, given the volume of words we analyze, the trends we observe seem to be independent of these limitations.
New: What are some interesting insights you’ve gained from this kind of large scale sentiment analysis?
Danforth: We found some pretty interesting things from analyzing tweets about the protests in Ferguson, Missouri. Our instrument found that over the seven days of August 15 to 21, positive words like “lol”, “hahaha”, and “laughing” were used less frequently in Missouri than in the entire United States, and the negative words “racist”, “violence”, and “protest” were used more frequently.
Our analysis reveals that sentiment peaks in the morning and decreases throughout the day, requiring a nightly reboot. Profanity increases throughout the day, peaking around midnight.
We also discovered that Kramer was the happiest character on Seinfeld by analyzing the sitcom’s 600,000 total spoken words. Jerry offered roughly 150,000 to Kramer’s 70,000, George said 110,000 and Elaine said 80,000. Compared with Jerry, Kramer used the happy words “buddy”, “delicious”, and “whoa mama” more often, and the negative words “not”, “don’t”, and “stupid” less often.
New: What are some potential future applications of Hedonometer?
Danforth: Imagine as a simple example that large sodas were banned in New York City. Using traditional survey based data, it might take years to measure the health outcomes in the city. If our instrument is able to infer changes in eating habits at the resolution of neighborhoods, and at a timescale of weeks, through what people are saying, then public health officials can use the information to target their campaign to reduce obesity far more effectively.
Hedonometer can be used today in the form of an unsolicited public opinion poll surrounding current events. Journalists can use the word shift graphs related a particular event to quantify the large-scale conversation, rather than simply choosing an anecdotal tweet. Our hope is that with Hedonometer, anyone will able to make and share geographically localized observations of crowd-sourced public opinion, and generate a defensible quantification of the collective conversation on Twitter and elsewhere.