The Center for Data Innovation spoke with Mauricio Santillana, assistant professor at Harvard Medical School and a faculty member of Boston Children’s Hospital Computational Health Informatics Program. Santillana discussed how Google and Twitter data can help effectively track diseases, how the U.S. Centers for Disease Control and Prevention (CDC) is incorporating such data into estimates, and how other non-traditional data sources can help researchers.
This interview has been edited.
Michael McLaughlin: You’ve used Google Search query and Twitter data in multiple of your studies. What makes these data sources so useful in predictive models? Are there any downsides?
Mauricio Santillana: I mainly work on machine-learning approaches aimed at monitoring and forecasting disease activity affecting populations around the world. Google searches and Twitter microblogs were not originally designed to track epidemic events. Instead, they reflect population-level interests in certain topics over time. With this in mind, we have developed ways to investigate whether or not specific disease-related Internet search patterns may contain a “signal” or signature that could be useful for monitoring disease activity. Our findings show that historically, Internet search activity of terms such as “flu symptoms” or “fever” show similar trends to those observed in patients seeking medical attention with flu-like symptoms. In other words, when more people search for these disease-related terms, frequently, more people with flu-like symptoms show up in the hospital seeking care.
Twitter microblogs that contain information about a user experiencing symptoms such as a cough, sore throat, or fever display similar historical trends.
Delays of a week or two in information meant to monitor influenza, collected by traditional healthcare-based methods, make these two data sources useful. Traditional epidemiological surveillance methodologies do not provide a real-time picture of the situation and do not provide us with an estimate of what may come in future weeks. We have found that data sources like Google searches, electronic health records, Twitter microblogs, and weather patterns, when combined with effective machine-learning methodologies, can enhance our understanding of what is happening now in the population and what may happen in the near future.
There are downsides. For example, Google constantly changes the way its search engine provides information. These modifications particularly affect the way people search for medical information and thus affect the ways in which specific search terms may provide us with meaningful information to be included in our predictive models. Also, we are all susceptible to “panic searching” when news outlets alert us of unusual flu, or dengue, or Ebola disease outbreaks. As a consequence, peaks of search activity and Twitter microblogs may only signal a population’s surge of interest in a disease-related topic but may not reflect actual infections.
Michael McLaughlin: I recently read about your work using machine learning to create a model that accurately estimates influenza activity. How does this differ from the now-defunct Google Flu Trends and the Center for Disease Control’s approach?
Mauricio Santillana: The now-discontinued Google Flu Trends was an important effort implemented by Google to help monitor flu and dengue activity in multiple locations. Its initial implementation in late 2008 had some important limitations that led to poor prediction performance. Some revisions made by Google improved the performance but never met the expectations of the public. My team and I proposed methodological improvements that borrowed ideas from weather forecasting systems that continuously learn from observations as they become available. These methods are built so that the machine learns to better perform its goal. We showed that these ideas could greatly improve flu estimates that only used Google search information and historical flu trends. Later, we introduced the use of multiple data sources such as electronic health records, Twitter, crowd-sourced disease surveillance platforms, connectivity between locations, etc. to make predictions more robust and less susceptible to undesirable inaccuracies emerging from noise in a single signal within a given data source. In short, our methods have drastically improved upon the original ideas that Google and other research teams proposed in the early 2000s.
All of our efforts in monitoring and forecasting influenza activity in the United States have involved close interactions with public health officials from the CDC. In fact, during the past three years, we have provided the CDC with our own flu estimates for multiple spatial resolutions in real-time. The CDC has created a collaborative environment that welcomes predictions from multiple teams and then combines them to produce “ensemble” flu estimates that incorporate predictions from these teams.
In short, we are one of many teams that work closely with the CDC to improve the way we produce and communicate disease estimates. In fact, the work you mentioned, where we documented the use of machine learning to produce improved flu estimates at the state level in the United States, included the contribution of a CDC official.
Michael McLaughlin: What kind of value does this kind of predictive model have for public health officials? How receptive are public health officials to using estimates that do not come from official government sources?
Mauricio Santillana: Public health officials have mixed receptivity to these big-data predictive approaches. Many of them show healthy skepticism and many are willing to engage with us to find ways to incorporate our disease estimates into the day-to-day decision-making processes within their agencies. It is a work in progress, but I am optimistic. I think there is great value in what we do, and we are learning to better identify what is useful to decision makers.
Michael McLaughlin: Could this model be useful for tracking other public health issues, or is the flu uniquely suited to this approach?
Mauricio Santillana: We have extended our approaches to monitoring diseases other than flu. These include Dengue fever, Zika, Ebola, Cholera, Plague, and Yellow fever, among others. While some of our manuscripts documenting these efforts have been published, some others are still in peer-review or in preparation.
Michael McLaughlin: What other kinds of data sources could be useful for this kind of disease tracking, but that you cannot access or use? What are the hurdles to doing so?
Mauricio Santillana: It would be great to gain access to more detailed Internet search information. Giants like Google, Facebook, and Amazon have incredible data that could be used to improve the monitoring of public health on our planet. For example, cell phone data is hard to obtain, but when it is available, with the proper privacy protections for users, we have a better sense of where people are traveling to and from. If many people from an infected area are traveling to other locations, then we can expect epidemic outbreaks to be seen later in these destination locations.