The Center for Data Innovation spoke with Ben Klemens, a mathematical statistician and principal researcher at the U.S. Census Bureau. Klemens gave an in-depth look at his work producing better data and models to aid the bureau and those who use its data.
This interview has been lightly edited.
Travis Korte: What are some of the projects you’re currently working on with the Census?
Ben Klemens: I’m splitting my time between writing tools to disseminate better models of populations and developing better models of populations. Some background: the Census Bureau runs over 200 sample surveys beyond the famous decennial census, and there will always be people who do not respond to a survey. There will also always be obvious errors, like a 90-year-old claiming to be a biological 1-year old child. The missing data have to be imputed (substituted with estimated values) and the blatantly bad data have to be edited out.
From what I’ve seen, the most popular means of fixing missing data among naïve users is to just delete any missing observations (i.e., listwise deletion). In almost all cases, this creates biased answers. Census Bureau data sets are shipped to the public with bad data edited and imputed because we want to make it as easy as possible for users who aren’t going to spend time fixing data to get the right answer.
Also, the Census Bureau will never release individual responses to survey questions (it’s a federal offense, as per 13 USC §9 and §214), so we have finer in-house data than the public sees. That means that an in-house imputation will always be of better quality than the same imputation routine done by a user on public data.
My group has been developing a software package named “Tea” to do survey post-processing. It’s a C library with an R front-end. Much of the code is about providing an easy means of selecting new imputation methods and specifying consistency checks. Our hope is that it can replace a lot of ad hoc post-processing code both in the Census and out in the wild.
In-house, we’ve used it on a few applications, and we are trying to smooth out more kinks. We’ve been working with the folks at GovReady to put together a virtual machine image to make it easier to use the Tea package outside the Census Bureau’s computing environment.
There’s a cliché that if you’re not embarrassed by your code, you waited too long to make it public. We did not wait too long; it’s still very beta. But users are welcome to try Tea and submit improvements via pull requests and comments via email.
I’ve also been working on developing better modeling, specifically the Small Area Health Insurance Estimates project. Small sample size at lower levels of geography prevents us from getting thousands of respondents in every one of the roughly 480,000 county × age × sex × income level × race/ethnicity categories for which users would like us to report health insurance prevalence. So small area models “borrow strength” from neighboring areas and categories to produce a more accurate and reliable low-level estimate.
The Census Bureau is looking at ways to use information households have already provided to the government so that respondents don’t have to provide the same information to a Census Bureau enumerator, wherever possible. This would decrease the burden on respondents, significantly decrease the cost of conducting surveys, and maintain the Census Bureau’s commitment to quality—all while safeguarding the confidentiality of personal information.
My group is developing a model to improve the low-level estimates of health insurance coverage, using several available data sources. Whether you have insurance is closely related to where you are relative to the poverty line, so the health insurance model has a submodel to do small area poverty estimates.
TK: Can you list a couple favorite things you can do with Census data that people might not already know about?
BK: The Census Bureau recently launched an API to a number of its sample surveys, so you can fold Census data into any programs you might be writing. There are already some nice examples, like this poverty map of New York mapping county-level estimates.
Our API team is looking for feedback and is working to evolve the API into something that a good developer who knows nothing about sample survey design can easily get right answers from.
TK: What are some of the toughest methodological challenges facing the Census today, and what’s being done to overcome them?
BK: Maintaining relatively high response rates. People don’t answer the door and pick up the phone like they used to, which has affected all surveys of the U.S. public.
For the decennial census, a lot of money is spent sending out interviewers to drive to the door of people who didn’t mail back the questionnaire. In 2010, census enumerators had to visit—often multiple times—47 million addresses that didn’t mail back the census questionnaire. The cost associated with that exercise is prohibitive.
Existing government data can help us close this gap. Our research is examining the strengths and weaknesses of the data, and how it might best be used. For example, some population groups are underrepresented in government data. Our goal with regards to imputation is to develop a model that extracts that information from these records in a manner that reduces bias.
If we are very confident about the imputations for an area, then our limited taxpayer dollars are better spent sending an interviewer elsewhere where the status is less certain. We’d like to be able to re-run these imputations every night as the nonresponse follow-up happens and adaptively design the next day’s interviewer schedules to best minimize the expected error in the final count.
TK: Tell me about Tea. What was the previous procedure for imputing missing data at the Census, and how does Tea improve upon it? What are some cases when it has been used?
BK: The incumbent method is, at its core, called “hot deck.” This is a reference to the hot deck of punched cards that came out of the computing machines of the 1960s. It effectively replaces a non-responding household with a neighbor’s record, although the methods have developed over the decades to adjust for anomalies (e.g., if two houses in a row are missing, you wouldn’t insert the same value in both).
We’re working on new methods to modify the hot deck flowchart for administrative records. The method we’re working on, basically an expectation-maximization (EM) algorithm to fold partial information into a multinomial distribution, lets us use the information from records where there is IRS data but no census form, IRS data and a census form, Medicaid records and IRS return but no Census form, and all the other combinations, to build a single model of households for imputing missing data points.
We used a version of Tea for disclosure avoidance in group quarters (dorms, nursing homes, prisons, etc.) for the 2010 Census and recent releases of the American Community Survey. We check for individuals who can potentially be identified via a set of cross-tabulations and blank out part of their information. Of course, once you’ve blanked out a datum, you have to impute a new value. And once you’ve imputed a new value, you have to check that you haven’t created any married 14-year-olds. That pipeline used to be a long ordeal over several scripts run by many people, and with Tea we were able to do it all in one place.
TK: What are some specific skills or areas of statistical knowledge you consider most essential to working with official statistics at a high level?
BK: Computing is not something we can assume away because off-the-shelf systems often don’t scale to 320 million observations. There’s a method of adjusting a table of data gathered via one means to a set of fixed totals gathered via another (like adjusting survey results to match the Census), known as iterative proportional fitting or raking. It was developed by W. Edwards Deming at the Census Bureau in the 1940s, and the implementation everybody uses is a FORTRAN algorithm from 1972. For example, you look under the hood at R’s ipf() function, and there’s that same code from 1972. So I spent some time developing an algorithm in C that can rake a sparse table of 2 billion cells (see the apop_rake function). It isn’t mathematically novel, but suddenly we could use a textbook method in situations where it had been closed off to us before.
I’ve mentioned the problem of fitting a sparse distribution over many dimensions a few times now. Once the distribution is fit using all our data, how can we help users visualize fifty dimensions on paper or a flat-panel monitor?
The small area problem gets bigger and bigger. For those readers who enjoy watching the Bayesian-Frequentist debate, this is a hot spot, because sample survey design has a Frequentist tradition and the Bayesian hierarchical models commonly used for small area estimation are obviously Bayesian. How you marry those without too much duct tape is a very open question.
I think I got through this entire interview without once mentioning a generalized linear model (GLM) like Ordinary Least Squares or Probit. People used to see GLMs as synonymous with statistics for social science, but the space of models is infinitely larger, and I’m always happy to see examples like the health insurance model that go beyond GLMs (even if they require novel computing methods to fit them to data).
Notice also that the use of computing power here is a little different from what you see in the typical Machine Learning (ML) textbook. ML methods are often about using computing power to bring more data to simpler models, while our data is currently capped around 320 million records and we are using computing power to fit more detailed models where each step is more computationally intensive.