The Center for Data Innovation spoke to Andrew Tatem, professor of spatial demography and epidemiology at the University of Southampton in England and director of WorldPop, an applied research group he founded. Tatem discussed how his team used mobility data to assess the risk of COVID-19 spreading out of Wuhan in the early days of the pandemic and how to address biases in mobility data.
Hodan Omaar: Can you talk a little bit about the work WorldPop does?
Andrew Tatem: We are an applied research group whose core focus is to improve the timeliness and spatial detail of data on population counts. Understanding the numbers, characteristics, and locations of human populations is central to informing public policy analysis and underpinning scientific development, but data on populations can be outdated or have low granularity. We work on integrating new, innovative geospatial datasets with more traditional data sources to paint a more accurate and timely picture about population distributions, compositions, characteristics, growth, and dynamics.
Consider what we do to be like constructing a data sandwich. First, we partner with data providers to get geospatial datasets that represent different characteristics of a region, such as census data matched to boundaries, household surveys, satellite imagery, mobile network data, and more recently mobility data from tech companies. Then we use these data within statistical models to produce gridded population count datasets, which just means that the resulting maps provide population estimates for 100 x100 square meter grid cells. This puts the data in a consistent, comparable format and gives us the flexibility to inform policy at a more granular level. Finally, we work with the end users, who are a variety of governments and agencies, to implement interventions based on our data and research.
We have more than 44,000 open datasets that are publicly available today across a variety of areas including births, pregnancies, age and sex structures, and internal migration, which supports research in areas such as accessibility modeling, resource allocation, disaster management, transport and city planning, and more recently epidemic modeling.
Omaar: You’ve been using data from mobile phones for many years. What’s so different about the smartphone data you’ve been using more recently?
Tatem: There are broadly two types of mobile-based positioning data: call detail records (CDRs) and application location history. CDRs are data that cell phone providers collect for billing purposes. Every time a person makes or receives a call or a text, their phone transmits or receives a signal from the nearest cell tower. Cell phone providers like Vodafone have logs of this information and can provide anonymized, aggregated data to help researchers explore the transmission of diseases. We have long been using CDRs to aid in the study of the transmission of many diseases such as malaria, dengue, cholera, and measles.
Application location history is data that is collected when a person has authorized location services on a smartphone application, providing a history of locations that a person has been to. For example, Facebook’s Data for Good program has provided anonymized, aggregated mobility data that is derived from this information and shows broad patterns about where people are moving from and where they are moving to. The benefit of Facebook mobility data is that it is consolidated to a user account, meaning it’s collected across all the different devices that a person is using the app on. The data is also precise because location is identified using the device’s internal GPS and connected Wi-Fi devices, which makes the data really useful for mapping travel routes across time and space. We’ve been using these data to support studies on the transmission of COVID-19 and to analyze the impact different interventions have had on its spread.
Omaar: WorldPop conducted one of the most interesting early studies illustrating the risk of COVID-19 spreading out of Wuhan. What’s the story behind that model?
Tatem: Yes, in February 2020 we published a model that assessed the risk of COVID-19 spreading in mainland China based on movement into and out of Wuhan during and after the Chinese New Year holiday. This holiday, which is 40 days long, is the largest annual human migration in the world, with hundreds of millions of people traveling across the country. To identify the regions most vulnerable to virus importation, we created a model to show the risk of the virus spreading using de-identified, aggregated mobility data from Baidu, the largest Chinese search engine; air passenger itinerary data from the International Air Travel Association (IATA); and case reports from the Chinese Center for Disease Control and Prevention (Chinese CDC). We found that a high volume of international airline travellers left Wuhan for hundreds of destination cities across the world during the two weeks before the first travel ban was implemented in the city. A range of policymakers used our outputs to prioritize the allocation of limited resources for surveillance to locations where the risks of importation were calculated to be highest.
We saw that such early interventions were extremely important. In a second study we published in May, we looked at the effect of early non-pharmaceutical interventions in China. These interventions included travel restrictions, early detection and isolation of cases, and social distancing measures. We built a simple model simulating the intensity of these measures using Baidu-based mobility data and case data from the Chinese CDC and found that if China had not implemented any of these interventions between January and February 2020, we would likely have seen 67 times as many COVID-19 cases in the country by the end of February. Our research also showed that the most impactful interventions in China during that time were early detection and isolation measures.
Omaar: Mobility data from apps don’t usually include users under the age of 18, users who have their location services turned off, and is not representative of the very elderly. How does this affect your work?
In general, the biases you have to contend with in a dataset depend on the purpose of your analysis. For example, the group at greatest risk from serious complications of malaria are young children, so using mobile data to track how people move when children do not typically carry mobile phones is a problem. But if you are looking at general flows of mobility or the change of flow into an area to study impacts on the economy, the fact that mobile data doesn’t cover children will be less of a problem.
What we have seen though, is that biases don’t just happen due to age but also due to socioeconomic status, and that can really affect estimates on human mobility. In countries where we’ve conducted multi-year studies, such as Namibia, we’ve seen the adoption of mobile phones change substantially. Adoption starts with the richest people getting phones first and then, as costs come down, people who are less well-off start getting phones. So in the earlier days of adoption, phone usage is not really representative of the population. This is one of the major challenges of using mobile phone datasets across long time periods in such countries. We have to account for these biases by corroborating the data against more traditional data sources like household surveys that are asking standardized questions across regular time periods of time.
In the end, there are some biases we can correct for and others that we can’t. The key is to make sure that policymakers who are using this data recognize that and present the limitations of the data as openly as possible.
Omaar: Why are there such differences in the availability of mobility data?
Tatem: We’ve been working with mobile providers for the last 20 years and are now increasingly working with tech companies, as well as long term collaborators such as Flowminder, a non-profit focused on using data for good. There are a wide range of sharing models because each company is in a very different situation. In some cases, it’s a company in a low-income country where there just isn’t the infrastructure to share data or there is a limited capacity to comply with data protection regulations and sensitivity, so companies just make the decision not to share. In other cases, companies do share their data, but make a decision to share it with a select number of groups, and in Facebook’s case they’ve made the decision to provide certain outputs that have been heavily processed to any non-profits and academics that are interested in working with the data.
Since the focus of our work is layering many different types of data, it is important for us to have a wide range of geospatial datasets. In lower-income countries, using mobile phone data is a promising avenue for complementing more traditional national statistics and obtaining more timely and local data. These data can really inform policies geared toward providing public services, so better partnerships between governments and phone companies supported by appropriate incentives could allow for more accurate and rapid production of national migration statistics to complement census and survey-based data collection.