The Center for Data Innovation spoke with Haile Owusu, chief data scientist at Mashable, a news and media website headquartered in New York City. Owusu discussed the growing importance of data science in the field of online media and why he thinks there are so many physicists in data science roles.
Joshua New: At Mashable, one of your big projects has been a tool called Velocity. What does Velocity do, and why is it so important for a company like yours?
Haile Owusu: I’ll start off with the early inception of Velocity. Before I joined Mashable, the editorial staff spent a lot of time figuring out what to write about. It was a very labor-intensive process—a writer will typically look at a huge array of content from Twitter, news feeds, preferred publishers, and even competitors, constantly scanning to see what people are writing about and are interested in. It’s such a large space to cull for potential stories, and they have to do it every single day. So we liked the idea that technology could help us eliminate a lot of this “dumb” human labor and even help to write stories. Our chief technology officer Robyn Peterson and chief executive officer and founder Pete Cashmore had the idea to monitor the usual suspects of topics for some of our writers and set up an automated system to keep track of the output for each of these publishers and RSS feeds and so on.
That was very useful, but it was primarily a passive support system for our writers. Mashable noticed that if writers jumped on a topic that was doing well in terms of popularity at that time, they got a lot more attention—essentially, if you write about a trending topic, it’s more likely that what you write will trend. So we wanted to upgrade this idea by trying to predict what will trend in the future, rather than just learn what is currently trending. There was an early version of this that wasn’t terribly impressive, so I was brought on to develop an algorithm to dramatically increase the accuracy of these social-engagement forecasts. As part of that upgrade, we also dramatically increased the capacity of Velocity. Now we can process well over a million websites per day, classify them, and then make a predictive assessment of their expected engagement on various social networks. So that’s the idea behind velocity, and it’s been quite successful for us. If we look at our top 10 stories, by shares or views, the majority of them were sourced by Velocity. I should also note that we’ve started to license Velocity to other companies because we’re so confident in it.
New: Other than to develop tools like Velocity, what use does a news website have for a chief data scientist? Is having one just a luxury, or do you think all news services need them if they want to be competitive?
Owusu: First I think it’s necessary to make the distinction between a chief data officer and a chief data scientist. My job as a data scientist means I’m responsible for algorithm development and deployment, statistical modeling, and underlying processes that lead to social engagement. The role of a chief data officer does have substantial overlap, but their primary job is to make sure that the organization is pulling in the right kind of data. I do some of that, but I’m more responsible for the mathematical heavy lifting.
It’s an interesting question about whether it’s a luxury for a news service to have one. I’d say yes, ostensibly. News services existed well before data scientists existed, if you believe that data science differs from the work of a statistician, but they live in a dramatically changing landscape. Today, most news services on the Internet make the lion’s share of their revenue from advertising. Advertising revenue from a particular unit of content tends to go down, so there’s this race that you have to run to accumulate as much attention on your content as possible. This poses a pretty substantial challenge, but the Internet also makes it vastly cheaper to distribute content. Before this era, it was impossible to know what happens after you publish your piece online. You don’t know if people will flock to it and you don’t know how they will hear about it. If it is popular, you would assume that it’s because the content is good, but it doesn’t always happen that way.
So, it can be a luxury to have a data scientist or team at a publication, but I think it’s increasingly becoming a necessity.
New: Before your current role, you headed research efforts at a company called SocialFlow, which uses data sourced like Twitter to help companies improve their social media presence. Could you walk me through what that process is like?
Owusu: SocialFlow has a really interested product that essentially allows you to customize engagement for your content by making sure social media posts were timed appropriately in the most active window of your Twitter followers. You can imagine that, if your audicence is based in Eastern Standard Time and you post an artcile at 3 AM, you’re probably not going to get the kind of engagement that you want. That’s an obvious optimization, but SocialFlow used a much more granular, localized strategy for tweet publication.
There’s a lot of aspects that you need to bring to bear for this kind of approach. Tweets are really compact pieces of language and people use all kinds of new slang and grammatical irregularities, so regularizing this language is always the first task for working with Twitter data. Twitter is really fascinating for just being a wild linguistic experiment.
New: You have a background in theoretical physics. Interestingly, this seems to be not that uncommon—we’ve interviewed a few other (1, 2) data scientists with similar backgrounds. How are these skills transferrable to more commercial applications?
Owusu: There are a couple reasons this happens. In a very general sense, physics is a very general approach to problem-solving. There are some subsets that can be vary narrow and focused, but at its core there are a handful of basic mathematical principles that can be useful for a lot of situations. For example, you’ll see a lot of physicists working on things like understanding city traffic flows because the training tends to develop the ability to abstract these problems to a form that those basic principles can solve.
There is a remarkable universality to it. For example, at my job now, people want to know who is going to view the content after we publish and how they are going to access it. That seems like a very different problem than, say, the generation of avalanches or the percolation of fluids in a granular medium, but actually they are the same problem at their core, just with slight variations. I think that’s why physics backgrounds are disproportionately represented in data science—because they tend to examine each problem for its universal traits to know how to solve it.
New: You’ve discussed before how you’re interested in cross-network effects—i.e. how users share and distribute news stories across multiple networks. What have you learned so far?
Owusu: The most interesting thing I’ve learned in predicting engagement for a piece of content is that even without granular personal user data—which I do not use—like that which can be gleaned from social media, we can still make some really accurate predictions. How many times a post will be shared or favorited, for example.
Another really interesting thing is that there is a real tendency for content to jump the fence. You’ll post something on Facebook, and it will make its way to Twitter, Reddit, LinkedIn, and so on. Compared to how publishers used to physically distribute content, this natural flow of content is almost magical. It’s so interesting how frequently how these cross-network hops occur. For example, Twitter is so much more rapid than Facebook and both have different user bases, but content still transfers between the two and is still successful.
There are also quirks. For example, we analyze the progression of how frequently a piece of content is shared. It normally has a pretty smooth trajectory for engagement, but there are numerous cases where, for totally mysterious reasons, it changes dramatically. Normally you’d think that this was the result of promotional activity like a sponsored tweet that exposed the content to more people, but this happens on its own all the time.There are facets of the network ecology that still baffle us, and we are really interested in figuring them out.