The Center for Data Innovation spoke with Gilad Lotan, Head of Data Science at BuzzFeed. Lotan discussed how companies can work to keep their algorithms accountable and the role data can play in the fight against fake news.
Joshua New: How does an organization like Buzzfeed use data differently than a more traditional media company, such as a newspaper?
Gilad Lotan: As someone who comes from the world of startups and technology, BuzzFeed to me feels like a tech- and data-first company rather than a media-first company. There’s a huge amount of investment in our technology layer and a huge amount of investment in our data infrastructure. I’d say this is very different than a lot of more traditional media companies. Part of this has to do with our approach and culture, but another part is that BuzzFeed isn’t slowed down by a legacy business model, so we can move much faster with significantly less risk when we want to try new things.
On the data infrastructure side, for example, a few years ago we decided that it doesn’t really matter where content lives because we found that Facebook, YouTube, and other platforms could be as valuable a place to create content and learn from audiences as our own sites. This was a big change from the way media organizations did things a few years ago. Many of them are still focused on bringing users to their sites. While we do want users to come to our sites and apps, it’s just as important to us to engage our audience on distributed platforms.
Everyone across the company is very attuned to data. We have 20 to 30 internal tools that we’ve built in house that give different stakeholders the ability to consume and query data. Everyone has access to our data warehouse and we train everyone who wants to know how to interpret this data. There’s also a big focus on feedback loops. Say you’re a content creator or a business analyst—there’s a focus on understanding the data behind the actions you’re taking and tweaking your actions to make them more effective. This feedback loop helps us get better at answering important questions, such as what our audience cares about and what kind of content our audience likes.
On all of these fronts, BuzzFeed looks very different compared to other media companies.
New: At SXSW in 2014, you gave a presentation titled “Algorithms, Journalism, and Democracy” which focused in part on how the general public tends to assume algorithms are inherently neutral. So, when algorithms produce an unintended outcome, people are inclined to think of this as malicious, rather than recognize that it’s simply the result of an imperfect formula. What’s the best way to ensure that journalism organizations can prevent these kind of mistakes from happening?
Lotan: Algorithmic systems are complex, and they’ve gotten a pretty bad rap over the past year. I’ve been writing about this topic for years now so it’s interesting that there’s all this new attention all of a sudden. So yes these systems are complex, but in many cases it’s never just one algorithm at play. You can think of an algorithm like a recipe or a formula—given a series of ingredients or inputs, you’ll get a certain kind of result. Algorithms operate all across the Internet, such as ranking systems or recommendation engines, and many times they’re interacting with each other. So there’s never just one recipe happening at once, and many times there’s rarely ever a person that fully understands how all the pieces interplay in a system. Even if an engineer or data scientist is designing a specific algorithm for a specific purpose, even if the algorithm is successful, when you look at the system as a whole there are many other components that are interacting with that algorithm and it can be really hard to untangle all of them.
I ran an analysis a few years ago about the effect of buying fake followers on social media. Fake followers are somewhat of a taboo—nobody talks about doing it, but everyone does it. If you’re a public figure you do it to show that you have a wide following, for example. Once you boost your status on one network, this can have a huge effect on other networks even if you used fake followers to do so. I’m not sure if this still works since it was a few years ago, but when you buy followers on Twitter, you then get a boost in search rankings on Twitter and major search engines.
There are all these different algorithmic systems that feed off each other and can have some strange impacts across platforms. The most important thing to do is recognize that this can be an issue, particularly for media outlets. The Facebook news feed, for example, is effectively the way in which a large part of the country stays informed about what’s going on in the country. However, there’s likely not any one person at Facebook who truly understands all attributes that influence this algorithm. Our role should be to help our users understand these algorithmic systems and the power that they can have. What we want to convey is that what people see is not necessarily truth just because algorithms use math.
Companies should also try to act as something of a watchdog for their algorithmic systems. This isn’t just true for media outlets—take Apple, for example. In the app store, an algorithm decides which apps to display as the most popular apps of the day, and whatever developers make this list can expect a large increase in downloads. That’s a lot of power for an algorithm to have, so effectively monitoring inputs and outputs can be really important to ensure that bias isn’t negatively impacting a system.
New: As a follow up, knowing that algorithmic systems will always carry some of this risk, what’s the best way to educate the public about how these systems work to dispel some of the hysteria that surrounds them? We hear the phrase “algorithmic transparency” a lot in this space, but just opening up datasets and algorithms to the public doesn’t seem like a useful approach here.
Lotan: I think transparency misses the point. If we published our ranking algorithms that chooses what appears at the top of people’s feeds, it wouldn’t actually solve the problem and wouldn’t help people understand what’s going on. Just looking at data or code isn’t helpful in identifying bias. What you really want to do is think about accountability, rather than transparency.
So how do we hold an algorithmic system accountability? There are a lot of ideas about how to do this. One of them is to monitor inputs and outputs. Google and other platforms actually let you look at the data about who they think you are, which influences how they rank search results or recommendations. By building multiple profiles and educating the public about how these different profiles influence the outputs of an algorithmic system, companies could help people understand how different users experience their platform. At BuzzFeed, we’re trying out a new feature on our news side called Beyond the Bubble which aims to show how different kinds of conversation surface in different spheres across the web on the same topic.
New: You’ve written a lot about the increasing prevalence of fake news as well as propaganda, which you believe is a more serious problem. How can data science play a role in reducing the influence of this kind of content?
Lotan: I do think propaganda is a much more important issue. The issue of false information is an important problem but it’s also a more solvable problem. There are more robust solutions to deal with fake news than there are for propaganda, because it’s much harder to draw the line on what makes something propaganda or not. Propaganda that intentionally leaves out important pieces of information is much more prevalent and has a greater impact than fake news. I’ve written several pieces about propaganda and misinformation, and their primary goal is agenda-setting.
It’s been great to see a number of data science initiatives pop up in the last few months dealing with classifying fake information. This could allow us to build predictive models that figure out what content is fake based on historical examples. Think of it like spam. Spam used to be a huge issue, and everyone got a lot more spam mail in their inboxes. But now we do a pretty good job of identifying and filtering out spam because spam messages tend to share a lot of similar traits. There are a lot of companies that work together to ensure they can have the most up-to-date models for spam detection. Given enough historical data, we could build similarly robust models for identifying fake news by collaborating across industry.
When it comes to propaganda, this is harder. We have to be wary of the actors at play. Where do you draw the line when outlets with different political leanings are focusing more on agenda-setting than informing? How do you filter out this propaganda automatically? I’m not really sure there’s a simple solution here. Fortunately there’s a lot of work being done by a lot of very smart people to experiment with different solutions.
New: You’ve also written about a related topic which you call “media hacking.” What is media hacking, and what was its role in the recent election? How can online platforms become more resilient to this tactic?
Lotan: Media hacking is the usage or manipulation of social media and associated algorithms to define a narrative or advance a political agenda. This could be done by both state and non-state actors. Something that’s fairly prevalent today is certain groups trying to force something into trending on Twitter, which can get concepts or organizations a huge level of attention. This is hacking because you’re effectively deconstructing the algorithmic system that powers Twitter’s trending topics module and figuring out ways to exploit it. It’s not entirely different from search engine optimization—taking steps to make your website more visible in searches. This doesn’t just happen on Twitter—people media hack Facebook newsfeeds, Google rankings, and so on.
Over the last few months, we saw that groups of political actors had a really good sense of how to exploit these algorithms and were organized enough to take advantage of them. One interesting example I’ve written about is “Hillary’s health.” It was effectively a conspiracy theory on some forums alleging that Hillary Clinton had major health problems, but you could see this concept seep into social media. You can still see the effects of this today, as conspiracy theory videos are usually in the top results when you search for this topic because people kept clicking on them.
So how do we become more resilient to his kind of hacking? That’s a great question and the solution is really complex. One good solution is what Google has been doing for years with search. They have been continuously evolving their algorithmic systems and deliberately not releasing some information about how they work to make them harder to game. There’s definitely an important manual component here too and humans can play a valuable role in protecting these algorithmic systems.