The Center for Data Innovation spoke with Francine Berman, the chair of the Research Data Alliance/U.S., an international organization that convenes working groups to break down infrastructural and cultural barriers to research data sharing. Dr. Berman, who is also a professor of computer science at Rensselaer Polytechnic Institute, discussed the dire need to develop models for funding research data management and stewardship, and why we should not forget about “data infrastructure” when talking about how to maximize data-driven innovation.
Travis Korte: First, can you introduce the Research Data Alliance and how it hopes to improve research data sharing around the globe?
Francine Berman: The Research Data Alliance (RDA) is a global organization whose mission is to develop and deploy the social, organizational, and technical infrastructure needed to share and exchange data. The organization’s focus is on implementation and impact, with an overarching goal of creating useful infrastructure (tools, policy, practice, standards, etc.) that contributes to the development and coordination of the global information infrastructure needed for data-driven research.
In practice, RDA members from across the data community come together to identify problems or areas for which specific data sharing infrastructure is needed to proceed. They form RDA working groups that create, adopt, and utilize the needed infrastructure within a 12-18 month timeframe so that progress can be made.
TK: You’re chair of the RDA’s United States chapter. What’s the difference between RDA and RDA/U.S.?
FB: RDA/U.S. is constituted from all U.S. members of the RDA. At present, RDA has 1,850+ members from over 80 countries across the globe. Over 660 of those members are from the United States, spanning 42 states and all sectors. RDA/U.S. focuses on 3 areas: 1) U.S. contributions to the leadership, impact, and data agenda of the global RDA organization, 2) leverage and amplification of RDA-developed infrastructure to benefit projects and organizations within the United States, and 3) exchange programs with other regions across the globe focusing on outreach, infrastructure adoption, and development of students and early career professionals within the data community.
With the support of the National Science Foundation, we’ve been ramping up RDA/U.S. activities this year. We’re looking forward next year to deeper engagement and partnerships with a broad set of U.S. R&D agencies, organizations, institutions, and community groups.
TK: What are some of the policy barriers that make it more difficult for researchers to share data?
FB: Your mileage varies as to whether research communities are supportive of data sharing or not. For some communities and groups (e.g., within many areas within the life sciences), data provides a competitive advantage and groups are not as inclined to share data until they have mined it sufficiently for relevant publications and grants. In other areas (e.g., many areas of computer science), there is no downside (and often upsides) to sharing data freely.
In communities more reticent to share data, policy can make a tremendous difference. The National Institutes of Health (NIH) have instituted data sharing policies that require federally funded grantees doing protein research, Alzheimer’s research, autism research, etc., to make their data available in public access data collections. In many of these areas, the availability of this data has driven and often accelerated new discoveries. Policy can drive culture change and in these instances, it did. This kind of policy provides critical “social and organizational infrastructure” needed to promote data sharing and exchange.
TK: When you spoke at Data Innovation Day 2014, you asked, “Who pays for research data?” Can you elaborate on this issue and give any updates? Has any progress been made toward answering the question?
FB: Arguably, the economics of developing, maintaining, and sustaining the infrastructure needed to host and serve up data is the Achilles Heel of data-driven innovation. If data is “homeless” or if no-one is paying the “mortgage” for data stewardship infrastructure, the data on which we increasingly depend will cease to exist. In spite of this, it remains challenging to maintain the investments needed to support stewardship and preservation infrastructure for much of the data created by the research community.
Many of us in the community are working hard to shed light on the need for continuous and stable investment in data stewardship and preservation. Without it, we cannot make research data publicly accessible, improve the reproducibility of data-driven research, support “big data” opportunities, and remain competitive in a data-driven age.
In my own efforts, this has been the focus of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access, the op-ed that Vint Cerf and I wrote in Science in August 2013, and of a new project that I’m putting together.
TK: Some of your recent talks have focused on the “data infrastructure” necessary to promote data-driven innovation. Can you speak a little about the infrastructure metaphor, and why it’s useful in thinking about different aspects of data collection, storage, analysis, processing, etc.? What are the main bottlenecks or challenges you see to improving this data infrastructure?
FB: While the focus in the press is often on the “explorers” in the brave new world of data, we also need the equivalent of roads, bridges, water mains, etc.—i.e., infrastructure that encourages broad use and availability—to make the most of our data resources.
You can think of “data infrastructure” as comprised of social and organizational components (policy, practice, and community standards), technical components (software, tools, and systems), and workforce components (data scientists and data-savvy professionals). All are needed to achieve the potential of data-driven innovation in all areas and across sectors, and all require the investment of resources. Finding sustainable funding for infrastructure is a real challenge, but without it, the data on which we depend is at tremendous risk.