• Study Causes Splash, but Here’s Why You Should Stay Calm on Alcohol’s Risks

    The following originally appeared on The Upshot (copyright 2018, The New York Times Company).

    Last week a paper was published in The Lancet that claimed to be the definitive study on the benefits and dangers of drinking. The news was apparently not good for those who enjoy alcoholic beverages. It was covered in the news media with headlines like “There’s No Safe Amount of Alcohol.”

    The truth is much less newsy and much more measured.

    Limitations of Study Design

    It’s important to note that this study, like most major studies of alcohol, wasn’t a new trial. It was a meta-analysis, or a merging of data, from many observational studies. It was probably the largest meta-analysis ever done to estimate the risks from drinking for 23 different alcohol-related health problems.

    The researchers also combined almost 700 sources to estimate the most accurate levels of alcohol consumption worldwide, even trying to find drinking that might otherwise be missed (from tourism, for instance). They then combined all this data into mathematical models to predict the harm from alcohol worldwide.

    They found that, over all, harms increased with each additional drink per day, and that the overall harms were lowest at zero. That’s how you get the headlines.

    But, and this is a big but, there are limitations here that warrant consideration. Observational data can be very confounded, meaning that unmeasured factors might be the actual cause of the harm. Perhaps people who drink also smoke tobacco. Perhaps people who drink are also poorer. Perhaps there are genetic differences, health differences or other factors that might be the real cause. There are techniques to analyze observational data in a more causal fashion, but none of them could be used here, because this analysis aggregated past studies — and those studies didn’t use them.

    We don’t know if confounders are coming into play because this meta-analysis could only really control, over all, for age, sex and location. That’s not the researchers’ fault. That’s probably all they could do with the data they had, and they could still model population-level effects without them.

    But when we compile observational study on top of observational study, we become more likely to achieve statistical significance without improving clinical significance. In other words, very small differences are real, but that doesn’t mean those differences are critical.

    Interpreting the Results

    The news warns that even one drink per day carries a risk. But how great is that risk?

    For each set of 100,000 people who have one drink a day per year, 918 can expect to experience one of the 23 alcohol-related problems in any year. Of those who drink nothing, 914 can expect to experience a problem. This means that 99,082 are unaffected, and 914 will have an issue no matter what. Only 4 in 100,000 people who consume a drink a day may have a problem caused by the drinking, according to this study.

    At two drinks per day, the number experiencing a problem increased to 977. Even at five drinks per day, which most agree is too much, the vast majority of people are unaffected.

     
    item.php
  • Preventing hospitalizations from nursing homes is harder than it looks

    I applaud the publication of negative studies. From JAMA Internal Medicine, “Effects of an Intervention to Reduce Hospitalizations From Nursing Homes: A Randomized Implementation Trial of the INTERACT Program“:

    Importance: Medicare payment initiatives are spurring efforts to reduce potentially avoidable hospitalizations.

    Objective: To determine whether training and support for implementation of a nursing home (NH) quality improvement program (Interventions to Reduce Acute Care Transfers [INTERACT]) reduced hospital admissions and emergency department (ED) visits.

    Design, Setting, and Participants: This analysis compared changes in hospitalization and ED visit rates between the preintervention and postintervention periods for NHs randomly assigned to receive training and implementation support on INTERACT to changes in control NHs. The analysis focused on 85 NHs (36 717 NH residents) that reported no use of INTERACT during the preintervention period.

    Interventions: The study team provided training and support for implementing INTERACT, which included tools that help NH staff identify and evaluate acute changes in NH resident condition and document communication between physicians; care paths to avoid hospitalization when safe and feasible; and advance care planning and quality improvement tools.

    Main Outcomes and Measures: All-cause hospitalizations, hospitalizations considered potentially avoidable, 30-day hospital readmissions, and ED visits without admission. All-cause hospitalization rates were calculated for all resident-days, high-risk days (0-30 days after NH admission), and lower-risk days (≥31 days after NH admission).

    We’d like to reduce hospitalizations from people who live in nursing homes, by keeping them from getting sick or hurt. The INTERACT program was designed to do just that. It supported and trained nursing home workers in identifying and evaluating issues in residents in nursing homes, communicating with doctors, and implementing quality improvement. Nursing homes were randomized to this or usual care. The main outcome of interest was hospitalizations, avoidable hospitalizations, readmissions, and ED visits.

    Eighty-five nursing homes with 281 752 person-months were included in the analysis. There was no significant change in the number of hospitalizations in the intervention group versus the control group. There was no significant change in readmissions, ED visits, or any of the sub-analyses of hospitalizations. There was a small, but statistically significant reduction in avoidable hospitalizations, but once they applied a Bonferroni correction, it was no longer significant.

    Let’s talk about that for a second. When you do a lot of statistical tests on a lot of potential outcomes, you increase the chance that something will be “significant” by chance alone. When that’s the case it’s good practice to apply a correction to account for that. Since they had six outcomes, the Bonferroni correction reduced the p-value threshold from 0.05 to 0.008. The result (as p=0.01) was no longer significant. Good for them. For more information on significance and p-values, I encourage you to watch our Healthcare Triage episodes on the subject here and here.

    We all want to improve the care in nursing homes and prevent hospitalizations and ED visits, if possible. This trial showed a pretty big intervention didn’t work. Let’s acknowledge that and try to do better, instead of doing the same things over and over again. It may take more investment. Sometimes good things cost money.

    @aaronecarroll

     
    item.php
  • You’re probably thinking of p values all wrong (p < 0.05)

    I’m guilty of getting this wrong, and it is likely you are too:

    If the p-value is <.05, then the probability of falsely rejecting the null hypothesis is <5%, right? […]

    [According to Oakes (1986)] 86% of all professors and lecturers in the sample who were teaching statistics (!) answered [a similar] question erroneously. […] Gigerenzer, Kraus, and Vitouch replicated this result in 2000 in a German sample (here, the “statistics lecturer” category had 73% wrong). [Links not in the original, oddly. I added them. They could be wrong.]

    Felix Schönbrodt explained how to answer the question:

    [W]e are interested in a conditional probability Prob(effect is real | p-value is significant). Inspired by Colquhoun (2014) one can visualize this conditional probability [for a specific example, though easily generalized] in the form of a tree-diagram (see below).

    PPV_tree

    Now we can compute the false discovery rate (FDR): 35 of (35+105) = 140 significant p-values actually come from a null effect. That means, 35/140 = 25% of all significant p-values do not indicate a real effect! That is much more than the alleged 5% level (see also Lakens & Evers, 2014, and Ioannidis, 2005).

    It’s obvious when you see it presented this way, and yet how often do you get this wrong? How much do you fixate on p < 0.05 as a very strong indication that rejecting the null is warranted (with at least 95% accuracy)? Be honest. It’s the only way to get smarter.

    Caveat: To the extent one is able, it’s more typical in health research to design studies with 80% power. For the example above, 80% power translates into an FDR of 12.7%, not 25%. That’s still way above 5%, though. A lot of results taken to be “true” are actually not.

    More here.

    @afrakt

     
    item.php
  • Fixed vs. random effects. (Or, how to have fun at a party.)

    Here’s a sure fire way to have fun at a party: invite an econometrician and biostatistician. After a few drinks have been served, casually drop, “I could never decide whether fixed or random effects are better. Thoughts?” Make sure you have plenty of popcorn.

    OK, listening to the econometrician rant about the bias of random effects (RE) and the biostatistician fret about the inefficiency of fixed effects (FE) might not be your kind of fun. I mean, it’s not “death panels” debate fun.

    Nevertheless, it’s a real debate, one which has been repeated many, many times, probably with about the same lack-of-consensus outcome driven by very similar considerations. (The within party variation in the debate is a lot higher than the between party variation. Har har.)

    A few, recent papers offer some insight as to how to decide which to choose. Though the papers address other considerations (e.g., how one might want to use the model for prediction or whether one needs to assess the effects of covariates that only vary between units), I will only consider the bias-precision trade off here. Both papers use similar simulation techniques to evaluate bias and mean square error (MSE) of FE and RE and both come to the conclusion that a Hausman test, sadly, is frequently not helpful. But all is not lost.

    In a May 2015 article in Political Science Research Methods [ungated here], Clark and Linzer conclude with the following rules of thumb:

    [CL1] When variation in the independent variable is primarily within units—that is, the units are relatively similar to one another on average—the choice of random versus fixed effects only matters at extremely high levels of correlation between the independent variable and the unit effects.

    This is intuitive because FE only exploits within-unit variation and RE relies on both within- and between-unit variation. When variation is largely within units, they’re both largely driven by the same thing. This takes RE’s greater efficiency off the table as a first-order concern. But one might still worry about bias when the independent variable is highly correlated with unit effects, which are effectively “unobserved” by the RE estimator.

    But,

    [CL2] [w]hen the independent variable exhibits only minimal within-unit variation, or is sluggish, there is a more nuanced set of considerations. In any particular dataset, the random-effects model will tend to produce superior estimates of β when there are few units or observations per unit, and when the correlation between the independent variable and unit effects is relatively low. Otherwise, the fixed-effects model may be preferable, as the random-effects model does not induce sufficiently high variance reduction to offset its increase in bias.

    The intuition here is that when within-unit variation is low and there are few observations, an estimator that is driven only by it (FE) will exhibit high imprecision. RE supplements within-unit variation with between-unit variation, increasing precision. Yay! That’s all fine, until endogeneity concerns start to dominate (high enough correlation between the independent variable and unit effects), at which point FE may start to look worthwhile again.

    Those rules of thumb aren’t very specific. What does “low” or “high” variation or correlation mean? The specific simulation results, as expressed in handy charts in the paper help answer that question. I’ll leave it to the interested reader to take a look.

    In PLOS One last October, Dieleman and Templin published a similar paper using similar methods. In addition to FE and RE, they also included simulation analysis of a “within-between” (WB) estimator, which I will leave to you to read about. Here are there rules of thumb, as they apply to FE vs. RE:

    [DT1] Another unique scenario when the RE estimator is consistently MSE-preferred and should be considered is for small samples that have relatively small within-group variation for the variable of interest. Again, in these cases, the imprecision of the FE and WB estimators might be more caustic than the RE estimator’s bias. In simulation cases with less than 500 observations and within-group variation less than 20% of the total variation, RE estimation leads to a smaller absolute error 53% of the time.

    This is the same advice as CL2, above. Sticking to consideration of small samples, the authors write that they

    [DT2] mark the circumstances under which a practitioner might consistently choose precision over bias. […] One scenario is when the estimated model explains a very small portion of the variation in the outcome measurement. When small sample size is combined with a poorly-fit model, the imprecision of FE and WB estimation tends to mislead the researcher more than the bias of RE estimation, even at large ρ [correlation between the independent variable and unit effects]. The goodness-of-fit [] can be explored by examining the R2 statistic associated with [FE] estimation. Considering only simulations with R2<0.5 and less than 500 observations, the traditional RE estimator had a smaller absolute error than the FE estimator 57% of the time.

    This seems to offer a reason to prefer RE, even when endogeneity concerns might seem to be high (large ρ). I haven’t figured out the intuition on this one. Why would failure to explain a considerable amount of variation (R2<0.5) make bias (endogeneity) a lesser concern?

    So much for small sample sizes. What about larger ones?

    [DT3] [A]s a general rule, the larger the sample size, the more a practitioner should avoid traditional RE estimation. Applying FE estimation on all simulated samples with greater than 500 observations led to a median absolute error of 4% of the true marginal effect. RE estimation led to a median absolute error of 8% of the true marginal effect. In simulations with more than 1,000 observations, RE estimation was only MSE-preferred beyond a trivial threshold (0.005) in a very few cases where 90% of variation of y could not be explained by the model.

    This may be a safe general rule but CL1 indicates circumstances when RE estimation is just fine, even at large sample sizes. Again, turn to the paper for specifics.

    Putting all this together, the statistics one should examine to make an FE vs. RE decision include:

    • Sample size in general and number of units and number of observations within units in particular
    • Correlation coefficient between independent variable and unit effects. (Clark and Linzer are explicit that it’s the correlation between unit means of the independent variable and unit effects one should consider, but I think they’re only considering balanced panel data, for which this would be the same thing as the correlation between the independent variable and unit effects.)
    • The proportion of the variance in the independent variable that is within units as opposed to between units.
    • The FE R-squared.

    With these, one can stare at the charts provided by Clark/Linzer and Dieleman/Templin and ponder one’s choices. Or, one could throw a party, invite some economists and biostats geeks and have at it.

    (What did I get wrong in this post? Comments open for one week.)

    @afrakt

     
    item.php
  • Tools for statistical writing and reproducible research

    This is a methods tutorial on dynamic documents for statistical writers. This will be useful for you if you write expository prose that makes arguments using statistical analyses of empirical data. But it also ties to a bigger topic, of interest to everyone who does or uses science, reproducible research, which I’ll get back to at the end of the post.

    Here is a standard setup for statistical writing. Your prose is in one file (say, a Word file). The statistical code that generates the statistics is in a second file (a .R file if you write in R). And the data that the statistical code processes are in a third file (perhaps a .csv file, or in a database).

    Having these separate files is an excellent thing. But it creates a lot of work, because to present a statistic in your prose, you need to extract it from the output that is produced when you run your statistical code. And the way most people extract a statistic from their output is either typing, or cutting and pasting the numerical result into your prose file. This creates problems.

    In any serious project, the results of statistical analyses change frequently. One reason is that the source data often evolve as new observations are collected or errors are discovered and corrected. These changes in the data affect your statistics, often in ways you may not anticipate. So you need to check and possibly update all the values in your prose. This is tedious, time-consuming, and because it is hand labor, it is prone to error.

    There is a great solution to this problem. You can write your prose file as a source file with embedded statistical programming code. This creates a dynamic document, meaning that it changes in an automated way when you make changes in the data or your statistical code.

    R has a fantastic free programming interface called Rstudio that makes dynamic documents very easy. The key is a nifty R package called knitr. You write and format your prose in R Markdown, a dialect of the markdown text formatting language. For example, here is a sentence from a paper on the cost of health care:

    The average per member per month cost was $`r mean(PMPM)`.

    There is embedded statistical code at the end of the sentence. I’ll explain it from the inside out. PPM refers to a vector of data (i.e., a variable) on per member per month costs in dollars. Somewhere else in your R Markdown file, you will have embedded code that reads in your data, including the values for PPM. Next, mean(  ) is an R function that calculates an average. Finally, the `r  ` delimiters identify what they enclose as a statement in R.

    So having written the code for your statistics right into your text, all you need to do is press a button in RStudio and knitr will process your R Markdown file into a Word, HTML, or LaTeX file. Let’s say that the average cost of care is $179.35 per member per month. When the above sentence is evaluated, the result is a formatted file of prose with this sentence:

    The average per member per month cost was $179.35.

    R Markdown has easy syntax for headers, italics, and most of the other things you need for word processing (using reference databases is still a bit of a challenge). Of course, if your prose only has one average in it, a dynamic document is overkill.  But your paper will more likely quote many statistics. If all the statistics in your paper are done this way, then you can instantly update the entire paper any time you change a line of your code or an entry in your data file. No more copy|paste. When you get skilled at this, you will save a huge amount of labor through automation.

    Suppose that you don’t use R?  To my knowledge, there is nothing like this for SAS or SPSS. There seems to be something in progress for Stata users (and see this great book by Scott Long about carrying out reproducible analyses in Stata).  I hear good things about IPython notebooks — if you know more, write me.

    Embedding statistical code in your prose source file will save you time and protect you from errors. But it also advances the cause of reproducible research. There is widespread concern that science is suffering a crisis in replicability. We want science to be objective and this means is that if two skilled scientists attempt the same empirical study, they ought to get the same results. Too often, published research fails to meet that criterion.

    The replicability problem in statistical or econometric research is that two skilled analysts, starting from (what is purported to be) the same data set, come up with different answers (the famous Rogoff and Reinhart controversy is reported here). Actually, coming back to a project after some months’ interuption, I have found it impossible to replicate my own analyses.

    It’s easy to understand why it is hard to replicate data analyses, for all the reasons described above. Data change and analyses have many steps. For years, I told myself to do a better job of documenting my code. But this was extremely challenging; I know very few people with the time or discipline to do it effectively.

    But when you create a dynamic document with embedded code, your prose documents itself. That is, the embedded code creates a clear trail back from a statistic quoted in a text back to code, and from the code back to a specific data set. (You must, however, have a set of practices that preserves a copy of that data set in the state it is was in when the research was published.)

    In theory, this means that my work will be completely reproducible if I share my dynamic document and the underlying data file. In practice, I can only share my work publicly to a very limited degree. Almost all of my research involves data on potentially identifiable patients, which can’t be shared. But I can, at least, show you the exact code that generated each statistic.

    @Bill_Gardner

    For a previous TIE post on reproducible research, see here. For a journal policy requiring documentation of statistical analyses to advance reproducible research, see here.

     
    item.php
  • Methods: Propensity scores

    Forthcoming in Health Services Research (and available now via Early View), Melissa Garrido and colleagues explain propensity scores. I’ve added a bit of emphasis on a key point.

    Propensity score analysis is a useful tool to account for imbalance in covariates between treated and comparison groups. A propensity score is a single score that represents the probability of receiving a treatment, conditional on a set of observed covariates. […]

    Propensity scores are useful when estimating a treatment’s effect on an outcome using observational data and when selection bias due to nonrandom treatment assignment is likely. The classic experimental design for estimating treatment effects is a randomized controlled trial (RCT), where random assignment to treatment balances individuals’ observed and unobserved characteristics across treatment and control groups. Because only one treatment state can be observed at a time for each individual, control individuals that are similar to treated individuals in everything but treatment receipt are used as proxies for the counterfactual. In observational data, however, treatment assignment is not random. This leads to selection bias, where measured and unmeasured characteristics of individuals are associated with likelihood of receiving treatment and with the outcome. Propensity scores provide a way to balance measured covariates across treatment and comparison groups and better approximate the counterfactual for treated individuals.

    Propensity scores can be thought of as an advanced matching technique. For instance, if one were concerned that age might affect both treatment selection and outcome, one strategy would be to compare individuals of similar age in both treatment and comparison groups. As variables are added to the matching process, however, it becomes more and more difficult to find exact matches for individuals (i.e., it is unlikely to find individuals in both the treatment and comparison groups with identical gender, age, race, comorbidity level, and insurance status). Propensity scores solve this dimensionality problem by compressing the relevant factors into a single score. Individuals with similar propensity scores are then compared across treatment and comparison groups.

    Propensity scores are a useful and common technique in analysis of observational data. They are, unfortunately, sometimes misunderstood as a way to address more types of confounding than they are capable. In particular, they can only address confounding from observable factors (“measured” ones, in the above quote). If there’s an unobservable difference between treatment and control groups that affects the outcome (e.g., genetic variation about which researchers have no data), propensity scores cannot help.

    It is important to keep in mind that propensity scores cannot adjust for unobserved differences between groups.

    Only an RCT or, with assumptions, natural experiments and instrumental variables approaches can address confounding due to unobservable factors. I will return to this issue.

    I’m deliberately not covering implementation issues and approaches in these methods posts, just intuition, appropriate use, and issues of interpretation. If you want more information on propensity scores, read the paper from which I quoted or search the technical literature. Comments open for one week for feedback on propensity scores or pointers to other good methods papers.

    @afrakt

     
    item.php
  • Methods: P values

    JAMA is running a guide to statistics and methods series. I come across methods tutorials in other journals from time to time as well. I think I’ll start excerpting and pointing readers to them.

    Let’s start with P values, as discussed recently in The BMJ. The setting is an examination of birth weight of infants whose mothers had been randomized to receipt of a certain diet (low glycemic index, but that doesn’t matter) or not.

    The P value for the statistical test of birth weight was P=0.449. The P value represents the proportion of the theoretical infinite number of samples—that is, 0.449 [44.9%]—that have a mean difference in birth weight equal to, or greater than, that observed in the trial above. This is irrespective of whether the mean birth weight was higher or lower for the intervention group than for the control group. More formally, the P value is the probability of obtaining the observed difference between treatment groups in mean birth weight (or a larger one), irrespective of the direction, if there was no difference between treatment groups in mean birth weight in the population, as specified by the null hypothesis. The P value for the statistical test of the primary outcome of birth weight was P=0.449, which was larger than the critical level of significance (0.05). Hence there was no evidence to reject the null hypothesis in favour of the alternative. The inference is that there was no evidence that the intervention and control treatments differed in mean birth weight in the population.

    Usually when I read a P value (or any statistic), I try to get my mind to interpret it according to the definition. I try not to let my mind wander into other (false) characterizations. For instance, I would read P=0.449 as, “Assuming the null hypothesis to be true (i.e., no effect of the diet), the probability of obtaining at least the observed weight difference between diet and control groups is 0.449.” (Secret: Because of the wishy-washy language in most papers, I’m often confused by the reporting of hypothesis tests and the only reliable way I’ve found to understand them is to go back to the definition.)

    I could write a lot more here, and I nearly did. But I’m going to try to keep these methods posts very short and focused. So, I’ll stop. Read The BMJ paper for more on P values, though you can find information elsewhere I’m sure. Also, I’m opening up comments to discuss P values and to solicit pointers to other good, simple methods papers. Feel free to provide additional resources. (Comments automatically close one week from the post’s time stamp.)

    @afrakt

     
    item.php
  • On Piketty and spreadsheets

    The following appeared on The Upshot (copyright 2014, The New York Times Company).

    Like Carmen Reinhart and Kenneth Rogoff before him, Thomas Piketty has had questions raised about his analysis; in his case, his work on wealth inequality. Though I can’t knowledgeably comment on the questions or the analysis, I can comment on the technology that Ms. Reinhardt, Mr. Rogoff and Mr. Piketty chose to do their work: the spreadsheet. This choice can increase the chances of error in complex analysis, but it also can make finding errors by nonexperts easier.

    Roughly speaking, one can think of economics analysis as taking one of two forms: It’s either descriptive or multivariate. Descriptive work is simple, which is not a criticism, because it can also be correct and powerful. It’s basically what can be easily illustrated and understood with a chart. Multivariate work can be very complex, though no less powerful to those who can understand it. As the name suggests, it’s an analysis that involves many variables simultaneously. And just because it’s complex, that doesn’t make it right. But there is a right way to do it, and it’s not with a spreadsheet.

    The process of going from original data to the conclusions of a multivariate analysis is not easily conveyed graphically. It is, instead, essentiallyalgorithmic. That is, conclusions are reached by starting with data and then applying a sequence of steps to arrive at answers to questions of interest. These steps can be and should be written down clearly and unambiguously and, for a computer to follow them, they must be. If this sounds like computer programming, it is. Modern, applied social science relies heavily on programming. It should.

    But it can’t with a spreadsheet (like Excel), because a spreadsheet isn’t primarily designed to be used that way. Its strength is that it makes visualization and manipulation of numbers easy to do with little training. It’s sort of a glorified standard calculator — the kind you undoubtedly have at home and use to balance your checkbook. This is also its weakness, because its simplicity has a cost: spreadsheets hide the details. They don’t make the sequence of steps in any analysis as transparent as they could be. They’re there, but they’re not front-and-center. This makes discerning what they are difficult and invites error.

    Try this puzzle: With a standard calculator, I started with the number 6, did some analysis to answer a specific question, and ended up with the number 28 as my result. What sequence of steps did I take to get there? If you think you know what they are, you’re almost certainly wrong. There are an infinity of ways, and a standard calculator doesn’t reveal which one I used. To be sure, the steps exist. But they’re in my head, and you’d have to do more work (like interview me) to discover them. I’m also likely to forget them. This might seem unimportant, because I have the answer: 28. But how do you or I know it is the correct one? The best way to convince ourselves of that is to look at the sequence of steps and check that they make sense. But we can’t do that easily. They’re hidden from view.

    A spreadsheet is only slightly better than this at revealing the process of analysis. You can make it out, but barely. You have to really work at it. That not only makes it hard for others to assess what one does to data, it makes it hard for even the creator of that spreadsheet to keep track of what he or she has done and to see and fix errors.

    For complex analysis, what social scientists usually do instead is write analysis steps in a statistical programming language, of which there are many. Such a program is like a recipe, one anyone familiar with the language can read. It says precisely how you go from raw ingredients (the data) to final product (the answer). Moreover, one can annotate such programs with plain-language descriptions of steps, making them even easier to understand and to find and fix errors. Analysis written out this way makes plain what has been done and why. Errors are far easier to find and fix than they would be in a spreadsheet.

    But Mr. Piketty’s work is not complex and multivariate. It’s fairly simple. And for that, a spreadsheet is a reasonable choice. Moreover, because advanced training is not required to examine a spreadsheet, by working in one, and sharing it, Mr. Piketty made it possible for more people to check his work. That’s praiseworthy.

    If the allegations hold up, Mr. Piketty may have made some errors in his spreadsheet. But the choice of that tool is not to blame for them. Were his work more complex, he’d likely have been better off using a statistical programming language. But it isn’t, and a spreadsheet is just fine.

    @afrakt

     
    item.php
  • Comparing the Massachusetts mortality study to the Oregon Medicaid study

    Many people are contrasting the Oregon study—which didn’t find statistically significant effects of coverage on biomarkers associated with physical health—with the new Massachusetts study—which found a statistically significant mortality benefit of coverage. How could these two findings coexist in a rational world? There are various hypotheses, one of which is that the Oregon study was underpowered (too small a sample size) to find physical health effects, as we’ve documented. (There are so many posts on this. Here’s just one.)

    Just to illustrate the difference in power of the two studies, I did a thought experiment. What if we presume that the same mortality effect that was found in the Massachusetts study applied in the Oregon case? Would the Oregon study have been able to detect it with statistical significance?

    The answer is no, and it’s not even remotely close. The Oregon (OR) study had a sample size about a factor of 100 below that of the Massachusetts (MA) study. That means the error bars would have been about 10 times larger. Here’s what that looks like:

    OR MA error bars

    The MA study found that mortality associated with the MA health reform was 2.9% below that of the study’s control group. The 95% confidence interval is from -4.8% to -1%. This is illustrated with the lower bar in the figure, marked “MA study.” The center of that bar is 2.9 units below 0, as indicated. The error bars do not overlap zero. This is a statistically significant effect.

    Now look at the hypothetical OR study “result.” The error bars are huge, about 10 times bigger than those of the MA study, overlapping the origin by a country mile. The 95% confidence interval runs from -29.9% to 16.1%. Such a result would not be statistically significant and, for good reason, no such thing was published. There was insufficient power (sample) to find anything on mortality that we didn’t already know. Such a finding would leave us scratching our heads as to whether health insurance improved mortality by nearly 30% or was 16% more likely to kill people, or something in between. I think it’s safe to say that most people have no trouble believing that if health insurance does anything, it’s effect is somewhere in the range of 30% more likely to save lives and 16% more likely to kill people.

    I am not suggesting it’s not worthwhile to compare the MA and OR study results. I’m just saying we should be mindful of the tremendous differences in sample, as well as statistical methods and regional contexts, of the two studies. Burn the above chart into your head. If that fails, consider a tattoo.

    @afrakt 

     
    item.php
  • The Daily Bayesian

    The article in Nature is worth a read.

     
    item.php