Last Friday, a group of New York data scientists convened DataGotham 2013, billed as a “celebration of the NYC data community,” at the New York Academy of Medicine on the Upper East Side. The event, now in its second year, brought together data-driven thinkers from an impressively broad range of fields, carrying an almost equally wide array of titles.
The guests included other data scientists, of course, but also self-described “data engineers,” “data managers,” “data janitors” and at least one “data superhero.”
The titles attested to the growing diversity of people, working in ostensibly unrelated fields and coming from disparate backgrounds, who count themselves among members of the data community. The speakers presented on topics that included storm prediction, medical information visualization, emergency response optimization, government procurement systems, civic hacking, online dating systems, healthcare operations and many, many others.
But for all their different approaches, the data enthusiasts spoke a common language, and common themes did emerge from the proceedings. One was the importance of perseverance in the face of old, messy, pre-“data science” data. Lauren Talbot, former Chief Programmer of the New York City Mayor’s Office of Data Analytics, brought up this point during her talk on a project to decrease response times to 911 calls. Struggling to make the city’s antiquated 911 response database of PDFs and physical documents interoperable with other data, she recalled the “a-ha” moment of locating an inconspicuous value on each document that could serve as a unique identifier and link the documents to other data sources. As a result, Talbot and her team could conduct rigorous analysis of data that otherwise could not even be aggregated.
A similar topic came up in data visualization designer Larry Buchanan’s talk. Buchanan, who makes visualizations for the New Yorker, told a personal story about the confusing presentation of medical information surrounding his wife’s successful battle with cancer. At the end of the talk, he showed alternative visualizations that made this information much more easily understood and suggested that straightforward data communication could make difficult life events like cancer treatment a little easier.
Another theme that emerged across fields was that good data scientists do not always need to have exactly the data they want in order to solve a problem. With a little ingenuity, the target information can be obtained from data that was originally collected for other purposes.
Huseyin Oktay, a Ph.D. Candidate in the School of Computer Science at UMass Amherst, made the point in astonishing fashion even though he spoke for only five minutes. He presented some of his research on predicting Twitter users’ ages using several decades of first name frequency data: “If your name is Ashley, you are probably less than 40 years old…But if your name is Deborah, you are probably more than 50 years old.” Not only did the sophisticated method largely replicate a major survey on Twitter demographics, it managed to compensate for a flaw in the survey.
Dan Chapsky, a data scientist who works on the social dating site AYI.com, stressed that these kinds of proxies need not even be the same across segments of the population being studied. In an analysis of dating behaviors among different demographics, Chapsky found that shared Facebook interests serve as a proxy for potential romantic matches among older people, while having mutual friends is a much better predictor of connection among younger people.
Regardless of where they work and what they call themselves, data scientists of all stripes have methods and technologies in common. If DataGotham 2013 succeeded, it was not just because the conference brought all these people together; it was because the talks managed to highlight the commonality and opportunities for collaboration in what can rightly be called a data community.
Thanks, Hadley Wickham, for the title suggestion.