Health spending growth is crowding state and local spending in other areas of need. But Medicaid expansion is not to blame. What is? Click to read the answer in my post on the AcademyHealth blog.
In an ideal health care system, you’d get the same (very good) care whether you were admitted to a hospital on a Monday, Wednesday, Friday, or Sunday. We don’t have an ideal health care system, and it turns out that day of admission matters. A new paper by Ann Bartel, Carri Chan, and Song-Hee Kim illustrates this fact, and then exploits it as an instrumental variable (IV) in an analysis of mortality and hospital readmissions.
Prior work by Varnava et al. (2002) and Wong et al. (2009) showed that hospitals would rather not keep patients over the weekend if they can discharge them on a Friday. Examining three hospitals in the UK, Varnava et al. found that discharges were most common on Fridays. Considering a hospital in Toronto, Wong et al. found that “[w]eekend discharge rate was more than 50% lower compared with reference rates whereas Friday rates were 24% higher. Holiday Monday discharge rates were 65% lower than regular Mondays, with an increase in pre-holiday discharge rates.”
Bartel, Chan, and Kim found something similar among US Medicare patients hospitalized for heart failure (HF), pneumonia (PNE), or acute myocardial infarction (AMI) in 2008-2011. The following chart from their paper plots the logarithm of length-of-stay (LOS) versus admission day-of-week for HF patients, controlling for age, gender, race, comorbidities, receipt of surgery, enrollment in Medicare Advantage, seasonality, and hospital fixed effects. (That’s why the figure’s caption calls this a “residual.”) As shown, HF patients admitted on Sunday-Tuesday have shorter lengths of stay than those admitted on a Wednesday-Saturday. A similar pattern exists for PNE and AMI patients.
Why? The hypothesis is that there is an incentive to get patients out of the hospital before the weekend, unless it’s pretty clear they’ll need to stay through the weekend. This could be due to patient demand (e.g., they want, or their family wants them, to be home on weekends). Or it could be due to provider factors (e.g., less staff on the weekend makes it harder for the hospital to provide care or plan discharges). Also, under diagnosis-based payment that Medicare uses, staying an extra day that could be avoided is all cost for no additional revenue.
Whatever the reason if the admission day is random with respect to outcomes, it could be a good instrument, a way to estimate a causal relationship between length of stay and things like mortality or hospital readmissions. If admission day is a good instrument, stratifying by it should balance observable factors, like comorbidities. If, for example, patients admitted earlier in the week are also sicker, then their outcomes could be worse not because they are discharged earlier (before the weekend) but because of their more severe illnesses, invalidating the instrument.
In principle it isn’t absolutely necessary that observable factors like comorbidities be balanced across values of the instrument, because they can be controlled for. However if there is not balance across instrument strata among observable factors, it should reduce our confidence that there is balance among unobservable factors, which is the key hypothesis for IV. So, checking balance on observables, like comorbidities, is a falsification test, something every IV study should include. (If one’s theory suggests that there ought not be balance on some, specific observables, then we might forgive that, and the analysis should control for them. But there must be some observables for which balance occurs or else why should we believe it does so for all unobservables correlated with outcomes?)
This falsification test is a direct analog of the typical “Table 1″ in a publication of results from an RCT. A standard table 1 shows balance of observable factors across treatment/control arms. If you ever saw an unbalanced Table 1, you’d suspect a breakdown in the randomization. The study would be fatally flawed. Well, one can and should do this type of test with IV too.
Considering HF patients, Bartel, Chan, and Kim do find balance of comorbidities when stratified by Sunday/Monday admissions versus admissions on any other day, but only for those with greater severity of disease. The reason could be that day of admission is more random for high severity patients; they may have less control over when they enter the hospital than other, less severely ill patients, the relatively sicker* of whom seem disproportionately to be admitted on Sundays and Mondays. Therefore, their instrument is probably not valid for less severe HF patients. A similar falsification test did not reject the validity of the instrument for the AMI and PNE study cohorts.
Main lesson: Do falsification tests. Adjust analysis accordingly.
The paper’s principal results are as follows:
- “For HF patients with high severity, one more hospital day decreases readmission risk by 7%. This relationship between LOS and readmissions does not exist for PNE or AMI patients, but we show that longer LOS can reduce their mortality risks by 22% and 7% respectively.”
- “Keeping all FFS [Medicare fee for service] PNE patients in the hospital for one more day would save 19,063 lives [over four years].”
- “Keeping all FFS AMI patients in the hospital for one more day saves 2,577 lives [over four years].”
These results suggest that discharges to avoid weekends and that shorten LOS harm patients, as does shorter LOS in general (at the margin examined). However, we should only believe them to the extent we believe the instrument. The falsification tests in the paper should increase our confidence in the validity of findings.
* To help you parse this: I’m talking about the relatively sicker among the less severely ill subset. This is bloggerrifically vague, but details are in the paper.
Ultrasound is better, but CT scans are the norm for suspected kidney stones. Here’s a streamlined version of the American College of Physicians summary of a new study that backs this up.
The study included 2,759 patients who presented with suspected cases of kidney stone to 15 geographically diverse academic hospital EDs, 4 of which were “safety net” hospitals. Patients were randomized to 1 of 3 groups: point-of-care ultrasonography performed by an emergency physician, ultrasonography performed by a radiologist, or abdominal CT. [...]
The study found a 0.4% rate (11 patients) of high-risk diagnoses with complications within 30 days, and this did not vary significantly by imaging method. [...]
The mean 6-month cumulative radiation exposure was significantly lower in the ultrasonography groups than in the CT group (10.1 mSv and 9.3 mSv vs. 17.2 mSv;P<0.001). The radiation in the ultrasound groups resulted from some patients going on to have additional testing, some of which included CTs. Median length of stay in the ED was significantly longer in the radiology ultrasound group: 7.0 hours compared to 6.3 hours in the ED ultrasound group and 6.4 hours in the CT group (P<0.001 for radiology versus each of the other 1 groups). Return ED visits, hospitalizations, and diagnostic accuracy did not differ significantly among the groups. [...] There was no significant difference in results between those with and those without complete follow-up.
The authors emphasized the results do not suggest that patients undergo only ultrasound imaging, but rather that ultrasonography should be used as the initial diagnostic imaging test, with further imaging studies performed at the discretion of the physician.
The study is here and an accompanying editorial is here. The next time I go to the ED with a suspected kidney stone—if there is a next time—I intend to bring both these papers with me, or pull up this post on my phone. A stone maker shouldn’t die of cancer induced by CT studies for treatment of stones. That would certainly not be “doing no harm.”
In frustration, I storify-ed some tweets about this.
People often ask me how I “do it all.” I think they mean all the blogging, on top of my regular job as a researcher. The simple answer is, I work a lot, much of it in short intervals of time away from my office.
But I very much doubt I work more than the average person who asks, “How do you do it all?” It’s just that a substantial amount of my work product is highly visible: the blogging. I think that gives the impression that I’m doing more in less time.
For all that, I may, in fact, manage time well, as I’ve been told by others for years. People have asked me for time management tips since I was in high school. As has Tyler Cowen and some of the “most productive people on the planet”, I’ve written some down for you below and in no particular order. These are just some aspects of how I generally work and live, only some of which may enhance my productivity.
- I do not work to deadlines. I start early and revise often. I first drafted this post five days ago.
- I keep my pipeline full. I always have stuff to do, to write about, to read. I don’t wonder, “What should I write? What should I read?” I have lists.
- I have many things in process at once. For example, at the moment I have over a dozen posts for various outlets in different states of completion. Some are done. Others are lists of links or notes.
- I protect the morning for the hardest work of the day, requiring the greatest concentration. I try to schedule meetings and calls for the afternoon. I read papers in the afternoons or evenings, with one major exception (see next item).
- I don’t drive on my commute. I walk and take the train. Walking (up to 6 miles per day) replaces what would otherwise be time spent at a gym or similar. During my commute, I catch up on news and, yes, some entertainment by podcast (at 2x speed—people speak too slowly). I read and take care of email on the train.
- I use Twitter, but mindfully. When I don’t have time for it, I ignore it. When I need a short break, I look at it. This has the advantage of combining some entertainment (which is what I seek during a break) with a lot of valuable information (given whom I follow). What feels like a break ends up being more useful, without my even noticing.
- Otherwise, I don’t read a lot of “news.” I read nothing out of a sense of obligation. I skim things in my RSS reader, sometimes flip through The New York Times online or in an app.
- I stop reading or just skim ahead things that are not well written, don’t speak to me, or don’t teach me anything. Sometimes I read posts and articles backwards (last paragraph, next to last, and so forth). I’m hunting for the incremental update.
- I watch little TV. I miss most movies.
- I reply to email that I intend to respond to at all within minutes, typically (except when circumstances do not allow). This probably isn’t productivity enhancing. I just think my colleagues and friends appreciate the responsiveness. Providing it makes me feel nice and useful.
- Unless I have a unique take, I don’t write about things that many others are.
- I seek feedback on my products, listen to it, and make changes as warranted.
- I’m nearly completely paper free. All my work products’ inputs and outputs are electronic and in the cloud. Same goes for life management tools like my calendar.
- I rarely take notes. When I do, they’re either electronic to begin with or transferred to electronic rapidly and where they need to be for future use.
- I try to remember where to find useful information, rather than trying to remember all the useful information. This is why I blog and tweet. They’re searchable memory aids. Also, I am fortunate to have access to highly reliable, external (human) memory.
- I ignore most office and institutional politics, skip every possible meeting, and don’t pay close attention at all times in most of those I attend. (These habits can be potentially dangerous. I have some protective workarounds, which rely on the skills, interests, and good will of others. Gains from trade.)
- I don’t take calls that are not pre-arranged and with people I want to talk to. I don’t listen to voicemail promptly, if at all.
- I say “yes” only to things I feel I can do well given the amount of time I think is asked of me. (Doing something well implies I want to do it.)
- When I take breaks, they are real breaks, without guilt. When needed, I have blown off weeks of evenings playing video games or reading novels. I take internet-free vacations. I trust myself that my motivation to work hard will return, but don’t force it. It always works out. (This takes practice.)
- I love to learn and write. I don’t try to do it. I feel a need for it. Then I just do it.
Two final points: Information in any form (reading, TV, podcasts/radio, the content of meetings, emails, and so forth) is almost entirely entertainment, with little lasting informational value. How much do you recall from a book you read three years ago, a movie you saw one year ago, an hour-long conference call you were on last month, an article you read last week, or a radio program you listened to three days ago? How much can you write down about it? What was the key point or message? With few exceptions, what took many minutes or hours to consume has been converted to, at most, a few sentences of information in your long-term memory. The rest of the information is not retained. From a long-term perspective, most of what you consumed was filler, momentary entertainment (if that), packaging, art, which is all fine and good, but not necessarily memorable information. For gathering information of long-term value, skipping or skimming the likely non-memorable parts and finding ways to codify in a searchable form the important, new information is more efficient, though not necessarily easy. (If one is seeking entertainment, inspiration, and the like, this is not applicable. I like art too!)
Finally, there are many other ways to be productive and types of productive people. Some of my very productive cobloggers work in very different styles, for instance. It leads me to suspect that one is not productive because of one’s methods, but one is simply productivity-oriented first and then develops personalized methods to suit.
The study is by Mehrdad Roham and colleagues:
We find that both the overall volume of services provided per capita and the average cost of these services decreased over our data period, once account is taken of changes in the age distribution of the population (the calculations relate to an age-standardized population) and in prices (all fees are expressed in constant dollar terms, using the consumer price index). However, these decreases are concentrated in services that have low HTI [Health Technology Intensity] and, to a lesser extent, medium HTI; over the same period, the average (age-standardized) number of services for high HTI increased by 55 percent and their share by 7.4 percentage points. We find also that whereas the decreases in the volume and cost of low and medium HTI services took place fairly uniformly across all age groups, the increases in high HTI were concentrated in the middle age groups and, more especially, in the old age groups.
The results suggest two main policy implications. First, technological change and its diffusion within the population are too important to ignore: decision makers (and the policy discussion) should focus on how the delivery of care is changing while, at the same time, accounting for the effects of external changes (such as population aging). Second, health technology assessment should be based on real-life ex-post studies of how health technologies are used by doctors and patients rather than one ex-ante studies of how they should be used. That would help health policy analysts and researchers to gain a better understanding of the relationships between aging populations and the relative distribution of spending on health care for different levels of health technological intensity. Taking into account the observed changes in the use of technology in relation to patient age will also help to produce better predictions of future health care expenditures. However, the important questions of whether the observed changes are warranted, in the sense of leading to better patient outcomes and being cost effective, are ones that we are not able to address. It would be of great analytical and policy interest to have records that include information about patient outcomes following procedures, and not just the procedures themselves.
The bit in bold (added) is a key point that many overlook. Many look to new technologies to cut costs and improve outcomes. That’s how they’re marketed. And, they very well may do so if their use is restricted to the subset of the population for which they’re ideally suited and designed. But what is typical is that technology diffuses more broadly than efficient use would warrant, in part because it’s good business. That ends up turning valuable technology into waste (or, more accurately, valuable for some, wasteful for others). And this is why I’m deeply skeptical of claims that any technology will actually cut costs and improve outcomes, on average, even if it does so for some.
Do you want chemo and three months of life, or six weeks of life without the nausea and vomiting that the chemo causes? Do you want high-risk open-heart surgery, with a fifteen-per-cent risk of dying during the operation, or would you rather continue as you are, with a fifty-per-cent chance you will be dead in two years? Do you want a prostatectomy, which has a five-per-cent chance of impotence and incontinence, or radiation, with a three-per-cent chance of leaving a hole in your rectum, or would you rather “watch and wait,” with the chance that your cancer will never grow at all?
That’s from Lisa Rosenbaum’s July 2013 piece in The New Yorker on shared decision making. Her most recent piece, which I also enjoyed, is this one on the relationship between extreme exercise and heart damage. It hits close to home because my wife will run her second 50k next month. Training alone includes several marathons over a few-week span. This, to me, is unfathomable.
Here’s another terrific piece by Lisa that taught me a great deal about stenting and helpful vs. unnecessary care. (This is saying a lot since I know quite a bit about this stuff already.)
It was in these gaps between data and life where I lost Sun Kim. There is no guideline that says, “This is how you manage an elderly man who asks nothing of anyone, who may or may not be taking his medications, and who has difficulty coming to see you because he vomits every time he gets on the bus.” In a world with infinite resources, we could conduct clinical trials to address every permutation of coronary disease and every circumstance. But that’s not the world we live in. And in our world, I reached a point where I could not keep Sun Kim out of the hospital.
The rest of Lisa’s pieces are here. I was not aware of her and her work until relatively recently or I’d probably have referenced it many times by now.
The paper by Thomas Cook, William Shadish, and Vivian Wong, “Three Conditions under Which Experiments and Observational Studies Produce Comparable Causal Estimates: New Findings from Within-Study Comparisons,” makes some good points. Below I quote from their paper, referencing some of my prior posts that express similar sentiments.
At least in some disciplines, randomized designs have a “privileged role,” supported by education and the research establishment.
The randomized experiment reigns supreme, institutionally supported through its privileged role in graduate training, research funding, and academic publishing. However, the debate is not closed in all areas of economics, sociology, and political science or in interdisciplinary fields that look to them for methodological advice, such as public policy. [...] Alternatives to the experiment will always be needed, and a key issue is to identify which kinds of observational studies are most likely to generate unbiased results. We use the within-study comparison literature for that purpose.
We should not expect results from observational studies with strong designs for causal inference to match those from experimental approaches in all cases.
But the procedure used in these early studies contrasts the causal estimate from a locally conducted experiment with the causal estimate from an observational study whose comparison data come from national datasets. Thus, the two counterfactual groups differ in more than whether they were formed at random or not; they also differ in where respondents lived, when and how they were tested, and even in the actual outcome measures. [...] The aspiration is to create an experiment and an observational study that are identical in everything except for how the control and comparison groups were formed. [...] We should not confound how comparison groups are formed with differences in estimators.
Past within-study comparisons from job training have been widely interpreted as indicating that observational studies fail to reproduce the results of experiments. Of the 12 recent within-study comparisons reviewed here from 10 different research projects, only two dealt with job training. Yet eight of the comparisons produced observational study results that are reasonably close to those of their yoked experiment, and two obtained a close correspondence in some analyses but not others. Only two studies claimed different findings in the experiment and observational study, each involving a particularly weak observational study. Taken as a whole, then, the strong but still imperfect correspondence in causal findings reported here contradicts the monolithic pessimism emerging from past reviews of the within-study comparison literature.
RCTs are simple to explain, but that’s just one criterion and not the most important one.
[Observational methods] do not undermine the superiority of random assignment studies where they are feasible. Th[ose] are better than any alternative considered here if the only criterion for judging studies is the clarity of causal inference. But if other criteria are invoked, the situation becomes murkier. The current paper reduces the extent to which random assignment experiments are superior to certain classes of quasi-experiments, though not necessarily to all types of quasi-experiments or nonexperiments. Thus, if a feasible quasi-experiment were superior in, say, the persons, settings, or times targeted, then this might argue for conducting a quasi-experiment over an experiment, deliberately trading off a small degree of freedom from bias against some estimated improvement in generalization.
But we should be concerned about accepting bad designs because they either (1) are simple or (2) have shown themselves to match RCTs in a different setting. We need to evaluate each design in the context of the particular questions being asked on each study.
For policymakers in research-sponsoring institutions that currently prefer random assignment, this is a concession that might open up the floodgates to low-quality causal research if the carefully circumscribed types of quasi-experiments investigated here were overgeneralized to include all quasi-experiments or nonexperiments. Researchers might then believe that “quasi-experiments are as good as experiments” and propose causal studies that are unnecessarily weak. But that is not what the current paper has demonstrated. Such a consequence is neither theoretically nor empirically true but could be a consequence of overgeneralizing this paper.
Even those of us who argue these points probably agree on this:
We suspect that few methodologically sophisticated scholars will quibble with the claim that [...] the notion that understanding, validating, and measuring the selection process will substantially reduce the bias associated with populations that are demonstrably nonequivalent at pretest.
Clearly I have not told you much about their study or findings. You’ll have to read the paper for that.