Tools for statistical writing and reproducible research

This is a methods tutorial on dynamic documents for statistical writers. This will be useful for you if you write expository prose that makes arguments using statistical analyses of empirical data. But it also ties to a bigger topic, of interest to everyone who does or uses science, reproducible research, which I’ll get back to at the end of the post.

Here is a standard setup for statistical writing. Your prose is in one file (say, a Word file). The statistical code that generates the statistics is in a second file (a .R file if you write in R). And the data that the statistical code processes are in a third file (perhaps a .csv file, or in a database).

Having these separate files is an excellent thing. But it creates a lot of work, because to present a statistic in your prose, you need to extract it from the output that is produced when you run your statistical code. And the way most people extract a statistic from their output is either typing, or cutting and pasting the numerical result into your prose file. This creates problems.

In any serious project, the results of statistical analyses change frequently. One reason is that the source data often evolve as new observations are collected or errors are discovered and corrected. These changes in the data affect your statistics, often in ways you may not anticipate. So you need to check and possibly update all the values in your prose. This is tedious, time-consuming, and because it is hand labor, it is prone to error.

There is a great solution to this problem. You can write your prose file as a source file with embedded statistical programming code. This creates a dynamic document, meaning that it changes in an automated way when you make changes in the data or your statistical code.

R has a fantastic free programming interface called Rstudio that makes dynamic documents very easy. The key is a nifty R package called knitr. You write and format your prose in R Markdown, a dialect of the markdown text formatting language. For example, here is a sentence from a paper on the cost of health care:

The average per member per month cost was $`r mean(PMPM)`.

There is embedded statistical code at the end of the sentence. I’ll explain it from the inside out. PPM refers to a vector of data (i.e., a variable) on per member per month costs in dollars. Somewhere else in your R Markdown file, you will have embedded code that reads in your data, including the values for PPM. Next, mean(  ) is an R function that calculates an average. Finally, the `r  ` delimiters identify what they enclose as a statement in R.

So having written the code for your statistics right into your text, all you need to do is press a button in RStudio and knitr will process your R Markdown file into a Word, HTML, or LaTeX file. Let’s say that the average cost of care is $179.35 per member per month. When the above sentence is evaluated, the result is a formatted file of prose with this sentence:

The average per member per month cost was $179.35.

R Markdown has easy syntax for headers, italics, and most of the other things you need for word processing (using reference databases is still a bit of a challenge). Of course, if your prose only has one average in it, a dynamic document is overkill.  But your paper will more likely quote many statistics. If all the statistics in your paper are done this way, then you can instantly update the entire paper any time you change a line of your code or an entry in your data file. No more copy|paste. When you get skilled at this, you will save a huge amount of labor through automation.

Suppose that you don’t use R?  To my knowledge, there is nothing like this for SAS or SPSS. There seems to be something in progress for Stata users (and see this great book by Scott Long about carrying out reproducible analyses in Stata).  I hear good things about IPython notebooks — if you know more, write me.

Embedding statistical code in your prose source file will save you time and protect you from errors. But it also advances the cause of reproducible research. There is widespread concern that science is suffering a crisis in replicability. We want science to be objective and this means is that if two skilled scientists attempt the same empirical study, they ought to get the same results. Too often, published research fails to meet that criterion.

The replicability problem in statistical or econometric research is that two skilled analysts, starting from (what is purported to be) the same data set, come up with different answers (the famous Rogoff and Reinhart controversy is reported here). Actually, coming back to a project after some months’ interuption, I have found it impossible to replicate my own analyses.

It’s easy to understand why it is hard to replicate data analyses, for all the reasons described above. Data change and analyses have many steps. For years, I told myself to do a better job of documenting my code. But this was extremely challenging; I know very few people with the time or discipline to do it effectively.

But when you create a dynamic document with embedded code, your prose documents itself. That is, the embedded code creates a clear trail back from a statistic quoted in a text back to code, and from the code back to a specific data set. (You must, however, have a set of practices that preserves a copy of that data set in the state it is was in when the research was published.)

In theory, this means that my work will be completely reproducible if I share my dynamic document and the underlying data file. In practice, I can only share my work publicly to a very limited degree. Almost all of my research involves data on potentially identifiable patients, which can’t be shared. But I can, at least, show you the exact code that generated each statistic.

@Bill_Gardner

For a previous TIE post on reproducible research, see here. For a journal policy requiring documentation of statistical analyses to advance reproducible research, see here.

Hidden information below

Subscribe

Email Address*