Shop


 Audio


NZZ Folio 01/06 - Thema: Statistik   Inhaltsverzeichnis

Plain Wrong

A lot of researchers use flawed statistical methods. As a result, much of the medical research now appearing in journals is at best unconvincing and at worse just plain wrong.

By Robert Matthews

Good news for highly-stressed women: a recent study by researchers in Denmark shows you are 40 per cent less likely than others to develop breast cancer. Now the bad news: according to another recent study, this time by researchers in Sweden, you face double the breast cancer risk.

So which is it ? Two studies, both by respected researchers and published in leading medical journals – and each contradicting the other. Such contradictory findings are perplexing, and far from unusual. Hardly a week goes by without some new study whose findings seem to contradict earlier research: overhead power lines and leukaemias, salt intake and hypertension, reduced heart disease and exercise - the evidence just seems to yo-yo this way and that, showing no signs of ever reaching a conclusion. Over the course of just a few weeks in 2002, two top medical journals made headlines with the results of two studies of a supposed link between smoking and breast cancer. The first claimed to have demonstrated a link; the second flatly refuted it.

It’s a similar story with new therapies: one team of researchers hails a breakthrough, then another comes along and knocks it down. In the early 1990s, hormone replacement therapy (HRT) was believed to halve the risk of coronary heart disease among women. By 2002, a major study concluded it had no protective effect at all.

What is going on ? Why do so many studies produce contradictory results ? It is a question increasingly being asked by scientists as well as perplexed members of the public. And some of the answers now emerging raise grave concern about the reliability of medical research published even in leading journals.

There is growing concern that researchers under pressure to publish results to keep academic positions and funding are deliberately suppressing “unhelpful” findings, focusing instead on findings that boost the chances of publication. But it is also becoming clear that many more are failing to use basic techniques for cutting the risk of reaching misleading conclusions from medical trials.

Most worrying of all, most researchers are continuing to put their trust in statistical methods shown to be fundamentally misleading over 40 years ago. As a result, much of the medical research now appearing in journals is at best unconvincing and at worse just plain wrong.

In common with researchers across academia, medical scientists have come under increasing pressure to “publish or perish” in recent years. Only recently, however, have attempts been made to gauge its impact on the reliability of medical research. And the results to date are disquieting.

All scientists know their chances of publication in leading journals are much higher if they have made some surprising discovery. The concern is that this tempts researchers into trawling through their data until they find something positive, while burying the bad news – thus giving a misleading picture of reality.

To see if such fears were justified, a team led by Dr An-Wen Chan of the Centre for Statistics in Medicine, Oxford, decided to track down the original paperwork for over 100 published papers reporting the outcome of medical trails. The team was looking for signs of “unhelpful” negative findings being omitted from published papers, in order to boost the chances of publication. In over half of the trials examined, the team found major discrepancies between the original aims of the study and those finally reported – suggesting that the researchers had simply trawled through their data looking for anything worth publishing, a practice all too likely to throw up fluke results. But the team also found that harmful effects discovered during trials was often not fully reported, while results on key issues such as pain intensity and survival rates were either watered down or omitted altogether.

Reporting their findings in JAMA in 2003, Gross and his colleagues pointed out that such discrepancies are especially worrying, given the rapid growth in industry-backed research, which now accounts for around two-thirds of biomedical R&D in the US - double the level in 1980.

In an attempt to tackle the problem of selective reporting, last September (2005) JAMA and several other leading medical journals introduced a requirement that researchers seeking to publish with these journals must register the clinical trial at the planning stage. The aim is to prevent trials that reach the "wrong" answers from being quietly buried by researchers. Since 2003, researchers reaching such answers can also submit their papers to the Journal of Negative Results.

Despite the pressures on them, most medical researchers remain committed to uncovering the truth about new therapies, whatever it may be. Yet while no-one can question their motives, there is mounting alarm over their methods. Put bluntly, many researchers do not seem to understand the methods they use to make sense of new findings.

My own study of all the papers in a recent volume of the leading journal Nature Medicine found that 20 per cent showed clear signs that the paper’s authors do not understand the statistical methods they are using. A similar conclusion has been reached by researchers at University of Girona, Spain, who uncovered a host of statistical blunders in papers published in two leading research journals: Nature and the British Medical Journal.

While most of the errors were trivial, a few per cent were judged to be serious enough to undermined the conclusions drawn. Even so, the findings prompted outrage in some quarters, with The Economist condemning them as “sloppy stats which shame science”.

Yet among statisticians, the biggest shock was that anyone should have been surprised by such findings. They have been warning for decades about the dismal level of statistical analysis used even by researchers in leading journals.

Statistical methods play a crucial role in medical research. They are used to gauge the size of clinical trial needed to stand a good chance of showing if a new therapy works, and also to see if the results are convincing. Or at least, they should be. In reality, most researchers simply try to recruit as many people for their study as they can afford, and hope the resulting trial is big enough to detect a real effect. And when the findings are in, they feed the raw data into statistics software packages, and hope at least one result turns out to be “statistically significant”, and thus publishable in a leading journal.

It all sounds straightforward enough, but it has led to a host of entirely spurious claims finding their way into the medical literature. A quick scan of medical journals reveals that clinical trials rarely involve more than several hundred patients. While this seems pretty big, statistical theory shows that such trials can still be far too small to detect useful effects. Yet by simply hoping they have recruited enough patients, medical researchers who fail to get a positive result can fall into the trap of dismissing the therapy as useless – when the real explanation is that the trial lacked the statistical power to detect an effect.

The risk of such “false negatives” is far from negligible: a study published last August by Prof John Ioannidis of the University of Ioannina, Greece, concluded that around three-quarters of small studies produce misleading results. It is a conclusion with particular relevance for complementary therapies like acupuncture. Researchers studying such therapies often lack the resources of mainstream medicine, so trials often involve fewer than 100 patients – and thus run an especially high risk of failing to detect real benefits.

The idea that small studies are less reliable than large ones should hardly come as a surprise. But even large trials can fall foul of another statistical effect, one which lies at the root of many of those contradictory findings, such as the link between breast cancer and stress. The existence of this effect has been known about for decades, and leading statisticians have repeatedly warned of its impact on the reliability of research – so far with little effect.

Put simply, the statistical methods routinely used by researchers fail to incorporate the key factor affecting the credibility of any finding: its plausibility.

When analysing the results of the clinical trial of, say, a new drug, researchers use computer software to find out if the proportion of patients who improved after receiving the drug is convincingly higher than for those given the alternative. A small difference could be due to mere chance. But if the difference is sufficiently big, it becomes harder to dismiss as just a fluke, and is deemed “statistically significant”.

Such results have a much higher chance of being published in leading research journals. Yet statistical significance takes no account of the plausibility of the claim being made. As such, it violates a simple principle of science: extraordinary claims demand extraordinary evidence.

There are ways of taking plausibility into account, and when applied to the findings of clinical trials, the results are often dramatic, with a host of “statistically significant” findings being revealed to be meaningless flukes.

Evidence of the dangers of putting complete trust in statistical significance alone has been available for years, in the form of bizarre claims published in serious medical journals. A classic example appeared in the British Medical Journal in 2001, which carried a paper with apparently compelling evidence for the effectiveness of prayer. The results revealed a statistically significantly higher rates of recovery among patients who are prayed for – even if the prayers are said years after the patients had left hospital. The findings suggested that prayers can travel backwards in time, and prompted calls for the notions of space and time to be revised. In reality, they simply showed the dangers of failing to include plausibility when assessing new findings.

In most cases, however, these dangers are far less clear. For example, in recent years there have been many reports of the discovery of genes apparently associated with illnesses such as cancer. These headline-grabbing claims are based on “statistically significant” links between the presence of these genes and the risk of contracting the illness. Yet all too often, the links simply vanish when other researchers try to confirm their existence.

In March 2004, a team from the US National Cancer Institute, Bethesda, pinned the blame on over-reliance on statistical significance in assessing the reality of such links. It may also explain another shocking finding, published in July 2005, that around one-third of widely-cited trial results aren’t confirmed by later research.

Once plausibility is taken into account, however, many otherwise perplexing findings start to make sense. Take those two contradictory studies of a link between breast cancer and stress. The Danish study found statistically significant evidence that women regularly exposed to stress face a lower risk of breast cancer. In contrast, the Swedish study found statistically significant evidence for a substantially higher cancer risk.

Analysis of the findings reveals, however, that the Danish study is substantially less compelling than the Swedish study. As for plausibility, in 2000 a team from Harvard Medical School published the findings from a huge study of almost 27,000 women. They showed no evidence for any link between stress and breast cancer risk. So large a study carries huge evidential weight – and when combined with the two new studies, it reveals them both to be unconvincing.

So the answer to which of the two studies we should believe is: neither of them. They may be statistically significant, but they are simply not credible.

There are signs that some leading medical journals are starting to recognise the dangers posed by the flawed statistical methods used by researchers. The failure of the journals to take action is nothing short of a major scientific scandal. Until they do, the best advice to anyone wondering whether to take seriously some new but implausible finding is – don’t.


Robert Matthews is Visiting Reader in Science at Aston University, Birmingham.



Teilen

Für 94 Franken pro Jahr gibt es NZZ Folio auch im Abonnement. Näheres hier.

Urheberrecht gilt auch im Internet: Verlinken erlaubt, Kopieren verboten.