We Don't Need the "Fragility Index"

The fragility index is an easy way to criticize a medical study you don’t agree with, but I don’t like it.

This week, I’m taking a break from our usually scheduled programming to talk about a newish concept percolating in the evidenced-based medicine space. Something called the “fragility index”. And no, it’s not another frailty measure for elderly patients, it’s about the stability of results in clinical studies.

A Lancet Oncology study found, for example, that of 17 recent randomized trials that resulted in a cancer drug receiving FDA approval, 9 had a fragility index of 2 or less, meaning that if just 2 “events” in the study were converted to non-events, the results would no longer be statistically significant.

Typically, fragility index comes up when people are trying to disparage statistically significant findings in the medical literature. But I want us to think a little bit deeper about this, because frankly I don’t really like this metric. It seems so beholden to our conception of a p-value of 0.05 as this magical thing that defines truth, when a p-value is just a continuous metric like any other.

Let me walk through a quick example to show you what I mean.

Imagine I find a coin on the street – a quarter – and I want to know if it is a “fair” coin.  Who knows, maybe someone has messed with it and I don’t want no adulterated currency jingling in my dungarees.

So… I do an experiment.  I flip the coin 100 times.

Adding up my results, I find the following: I got heads 60 out of the 100 flips. 

Now, if this were a fair coin I’d expect 50 heads out of 100.  So perhaps my dander is up a bit. What is going on at the Delaware mint?

But wait, you say.  Just because a fair coin would have 50 heads on AVERAGE doesn’t mean it HAS to come up with 50 heads.  There’s going to be a range there. In fact, the range looks something like this.

Flip a coin 100 times and you’ll get, on average, 50 heads - but there’s a range. 60 heads is pretty weird! But not THAT weird.

So how weird are the 60 heads that I saw?

Well, assuming the coin on the street was fair, I’d see a result as weird as the one I got about 4.5% of the time.

Or, in p-value terms, 0.045.

In other words, my results are statistically significant. By our conventional definition, I will be calling my local numismatist and making a complaint.

But wait! You say.  These results are fragile.  If just one of those 60 heads had actually come up tails, I’d have 59 heads! 

And my p-value would be 0.07.

That changes everything. This is NOT statistically significant. Pitchforks down.

But what has really changed here? Proponents of the fragility index would say that we have taken a positive study, and with the most minor of changes, made it negative.

How frightening that the scientific literature should be so delicate? So ephemeral?

But we’re missing the point. The p-value is not magical. It just provides information. How weird was it that I got 60 heads?  Pretty weird – a deviation that large only happens 4.5% of the time.  How weird was it that I got 59 heads? Pretty weird!  A deviation that large only happens 7% of the time.

The problem isn’t that medical studies are fragile, it’s that we are WAY too beholden to the p-value. We need to be willing to reject studies that have a p-value of 0.045 if the hypothesis being tested is unlikely or the methodology is flawed. We need to be able to accept studies with a p-value of 0.07 if the hypothesis being tested is very likely.

Look at it this way – assume you’re really worried about fragile medical studies, what are the potential solutions.

First, we could lower the p-value threshold for statistical significance – there’s an ongoing debate about that. Of course, there will be “fragile” studies at any threshold. If the p-value threshold were 0.01, people might complain that studies are fragile because small changes in outcomes will change the p-value from 0.009 to 0.011.

Or maybe you think we should just do bigger studies? But realize here that if the effect size is the same, doing a larger study will just lower the p-value. And if you do a really large study and get a barely significant p-value, you’re probably in the realm of a statistically significant finding that doesn’t have much clinical impact.

We see so-called “fragile” studies because studies are designed with the p-value threshold of 0.05 in mind – because studies are expensive, and expose people to potential risk. If you’re spending $50,000 per patient to enroll a clinical trial of a new cancer drug, and all you need is a p-value of less than 0.05 to get FDA approval, well, why would you enroll more than what you need? We’re changing the rules after the match is over.

The real solution is to forget about 0.05 and interpret p-values in the context of the underlying hypothesis of the study.

Take my quarter. Are my 60 out of 100 heads really going to convince you that it’s a biased coin?

That someone actually shaved part of it, or weighted it in a weird way? Probably not – I just found the thing on the street after all.  The hypothesis being tested – that it is a biased coin – was very unlikely. So we should see those 60 heads as a weird fluke. NOT confirmation that the coin is weird. Use the data to update your prior probabilities.

So if you’re reading a study, and someone says – but wait – if just two people who survived died or vice-versa, this positive study would be negative you should say “that’s so interesting. I also agree that a p-value threshold of 0.05 is arbitrary and a study should be interpreted based in the light of the strength of its underlying hypothesis”. And as your colleague slowly backs away, feel comfortable that the difference between a p-value of 0.049 and 0.051 isn’t much of a difference at all.

 This commentary first appeared on medscape.com.