5 Q’s for the Data Science and Mining team at École Polytechnique about Facebook’s Data for Good Program

The Center for Data Innovation spoke with Michalis Vazirgiannis, George Panagopoulos, and Giannis Nikolentzos, who are members of the Data Science and Mining team at the Computer Science Laboratory (LIX) of École Polytechnique in France. The team discussed how they have developed a machine learning model that can forecast the number of COVID-19 cases a region will experience up to 14 days ahead of time.

Hodan Omaar: What is the goal of your work, and how are you using Facebook data to achieve this goal?

Michalis Vazirgiannis: We created a model to forecast the number of COVID-19 cases in small regions of European countries during the first wave of the pandemic. Capturing and modeling the dynamics of disease diffusion lends itself well to graphs, which in computer science are a type of data structure that are used to represent networks. We created a graph where nodes represent the regions in a country and the edges, which connect two nodes, represent the daily movement between them.

George Panagopoulos: To create these graphs, we obtained data about COVID-19 cases in particular regions from open sources such as Github or government websites, and mobility data from Facebook’s Data for Good program. Then we made graphs to encode this data in the four European countries that were initially hit hardest by the pandemic: France, Italy, Spain, and the United Kingdom. Essentially, the graphs model how COVID-19 spreads between regions in these countries.

The idea is, once you have graphs representing how disease spreads between regions in a country, you can train an AI model to learn from these graphs and predict how and where the disease is likely to spread next. You can also apply that model to another country and make useful predictions (which is called transfer learning). This type of model is called a graph neural network (GNN) and is a type of deep learning model.

While there are other researchers who are using GNNs to predict the spread of COVID-19, our model is different because it uses less data to make accurate predictions. For instance, there are GNNs that have been developed to predict the spread of COVID-19 in U.S. counties, but these models need 50 to 60 days worth of training data. That means the disease has to spread in a region for almost two months before the model can make credible predictions on where the disease will spread next. But by then, too many lives have been lost. Our model improves on this by reducing the amount of training data needed to 14 days. We are able to do this by utilizing a technique called model-agnostic meta-learning (MAML), which uses transfer learning to capitalize on knowledge from other countries’ models.

Omaar: Who is the end user of the insights from your project? What decisions do you anticipate they will make with your work?

Vazirgiannis: The French government funded our project with the goal of developing innovative solutions to COVID-19 related problems. We have shown that our model produces accurate predictions about the spread of COVID-19 up to 14 days ahead. We hope that policymakers use our model to make more informed decisions about interventions and resource allocations, especially at the local level.

If policymakers use our model to make better predictions at regional rather than national levels, interventions such as lockdowns can be more targeted. This will be easier to manage and lead to better impacts on the economy and society as a whole.

Omaar: What are the limitations of the model?

Giannis Nikolentzos: One of the main limitations of our work is that currently, mobility data for any particular region is limited to a fixed number of other regions. For example, if you want mobility data about Cannes in France, you might only be able to investigate how people moved from Cannes to other regions in France. You wouldn’t be able to investigate how people moved from Cannes to say, Zurich in Switzerland. Availability of mobility data between regions at the granular level is limited in range.

Of course, it must be said that there are challenges to increasing this range. As the amount of cities you are able to query between increases, the amount of data increases exponentially. Handling this type of data requires more complex data infrastructure with bigger data pipelines to process it all.

Omaar: What types of challenges do you face in validating the data?

Panagopoulos: To gather data about COVID-19 cases in different countries, we had to traverse different government websites that were all in different languages. What’s more, we had to map this data against the mobility data from Facebook. That means we had to match geocodes between the two datasets. But that is hard to do because there is no clear way to do it.

In some cases data is aggregated at different levels. For example, the mobility data may be summed over a region that covers 1000 people while the COVID-19 data for the same region is aggregated over hundreds of people. This means we have to manually aggregate the COVID-19 data over subregions to match the format of the mobility data. In other cases, a region might be missing from one dataset and available in the other, or they might be using different names for the same data! It took a lot of work to map the data correctly.

Omaar: Looking to the future, will you be doing more work in this area?

Panagopoulos: Yes. The first leg of our project focused on the dynamics of the first wave of COVID-19 cases and now we will expand our research to study the second wave. One of the interesting observations from our earlier work was that despite the fact that mobility between regions fell and stayed low over time, mobility within regions changed. After an initial period of lockdown, mobility within regions began to increase over time. We will take these observations into the next stage of our research.

Nikolentzos: We will also incorporate some additional features into future models, such as demographics regarding age and gender, as well as features related to the weather. We are also looking at including additional data from Facebook, such as the intensity of connectedness between regions measured by the friendship relationships between two regions.

5 Q’s for the Data Science and Mining team at École Polytechnique about Facebook’s Data for Good Program

Curating Public Tweets for Academic Research

The EU’s Digital Services Act Should Not Have Separate Rules for Large Platforms

You may also like