10 Bits: The Data News Hot List
This week’s list of data news highlights covers July 20-26 and includes articles on the White House’s recognition of open government and civic hacking leaders and a novel approach to data mining in astronomy.
1. White House names Champions of Change for open government and civic hacking The White House honored fourteen individuals who have contributed to open government and civic hacking projects across the country. Among the honorees were developers, policy experts and facilitators, who worked on a wide range of projects from fight mediation in New Orleans to a community information website in Oakland. Several honorees are affiliated with Code for America, which was a leading organizer of the National Day of Civic Hacking in June.
The Cuckoo’s Calling, a novel by one “Robert Galbraith,” was recently determined to have actually been written by Harry Potter author J.K. Rowling. Analytics played a major role in this revelation, with a spate of text mining techniques and mathematical similarity measures being used to confirm that the two books were stylistically and lexically similar enough to have likely been written by a single author.
Data analytics has infiltrated a wide range of fields, and the adult entertainment industry is no exception. Even aside from everyday web analytics and business intelligence analysis, one researcher has used the Internet Adult Film Database’s data on 7,000 female actresses to illustrate the “average” porn star and lay to rest the myth that most actresses exit the industry after a single film. Some in the industry are skeptical about the value of data in optimizing certain aspects of their product, with one executive saying that the factors that make films popular have ”nothing to do with averages.”
The OECD’s educational data collection effort, known as the Program for International Student Assessment (PISA) launched a new initiative this year to involve schools in advanced analytics at a local level. “MyPISA” is currently being deployed in Virginia’s Fairfax county, where the OECD will help ten schools conduct data analyses and situate their performance among other schools around the world.
The National Institutes of Health has pledged over $96 million in funding toward its Big Data to Knowledge initiative, which aims to enable the biomedical research community to better utilize the growing volume of biomedical data. Data-integration and interoperability is a focus of the initiative, as much of the world’s biomedical and health data is still siloed within individual organizations. Standards are important to the initiative as well, to enable novel uses and repurposing of data.
Travel rental company Airbnb released a technical explanation of its ranking of America’s most hospitable cities. The “hospitality index” it used accounts for parameters like cleanliness, communication, value and listing accuracy (all mined from the company’s review database). Moreover, the Airbnb team identified certain host factors that predicted better reviews; older hosts tend to be more hospitable, and female hosts get higher ratings than male hosts, for example.
Two separate research projects from Carnie Mellon University looked at data from a drawing app to compare different people’s drawings of the same subjects. The researchers modeled patterns of inaccuracy and proposed methods for automatically correcting such inaccuracies in future drawing apps. Some commentators are not convinced that artistic ability should be treated as data and optimized, but some potential applications, such as art education software, may remain.
Ubiquitous sensors, GPS data and traffic data are poised to combine into a highly valuable resource for the automotive industry, particularly with the emergence of self-driving cars. Google’s prototype in this space generates nearly one gigabyte of data per second, which will be widely useful for car companies predicting demand by region and prioritizing car features based on user satisfaction, even outside its immediate objectives of modeling a quick, safe path from point A to point B. Self-driving cars won’t be the only ones to benefit from new data sources: simply collecting granular data on driving habits could enable machine learning algorithms to model drivers’ “styles” and then detect potential car theft when these styles abruptly change. Driver-level data could even feasibly be used to provide driver performance improvement recommendations directly to a car’s dashboard.
New software may help astronomers detect faint-but-numerous stars known as red dwarfs by mining data from existing stellar images. The software, which is used to automatically populate a star catalog known as Superblink, compares images of the same stars taken at small intervals, and automatically detects subtle motion that may be indicative of a hidden star.
Bright, a company that collects resumes and uses data analytics to match companies with job applicants, released an analysis of the tech job market that found that the supposed shortage of American applicants for certain job functions has been exaggerated. From a database of 3 million resumes, Bright determined that while some positions, like computer systems analyst, faced a shortage of American applicants, positions for highly skilled computer programmers met with more than enough applicants from the United States. The study’s results will likely need to be replicated using other metrics besides the proprietary ones used by Bright, but such analyses will have an important role to play in the ongoing debates over guest worker visa policies.