OUP user menu

The bumpy road to evidence: why many research findings are lost in translation

Thomas F. Lüscher
DOI: http://dx.doi.org/10.1093/eurheartj/eht396 3329-3335 First published online: 3 October 2013

We have come a long way

When, more than 2500 years ago, Thales of Milet (624–547 BC) claimed that nature was ruled by laws and not by gods, he changed the world.1 Such a concept allowed for the discovery and mathematical proof of impersonal causation of what, until then, had been a mystery ruled by unpredictable gods. Indeed, ever since then the understanding of nature—today we would call it the natural sciences—has become a major activity of mankind. With this strategy, we have come a long way. Initially, it allowed us to use the position of stars and planetary motion for navigation and the discovery of new continents; next, it set the basis for the development of engines and technologies; and finally, it led to the discovery of the human body. Eventually, this allowed for the rise of modern medicine, among many other achievements.

As a consequence, theology and philosophy, the dominant disciplines of ancient times, were increasingly challenged and, to a great extent, replaced by scientifically based knowledge about the world, about nature, mankind, and disease. Such knowledge produced pratically useful consequences, such as pumps, steam, and later, petrol-driven engines, trains, cars, aeroplanes, and rockets, allowing us to fly to New York or to the moon. In life sciences, it brought about hygiene, anaesthesia, and in turn, aseptic surgery, antibiotics, vaccination, rescucitation, cardiac surgery, and interventional cardiology;2,3 an impressive and unanticipated achievement indeed.

What is evidence?

What is evidence? At the beginning of any discovery stands an individual with curiosity; an initial observation is the first step in the process. When Columbus sat at the beach—as the famous movie by Ridley Scott leads us to believe—watching ships leaving the harbour with his son Diego, he asked him, ‘Look!’

‘Half of the ship has gone’, replied Diego.

‘And now?’

‘It's gone.’

‘What does it tell you?’

Diego was not sure.

‘It is round’, replied his father, ‘like this’, presenting an orange he was about to eat.

Thus, the interpretation of a finding is as important as the observation itself. But that is not enough; it needs proof, i.e. the journey to the West and the persistance to persue it. Columbus did it all and discovered a new continent, now known as the Americas.

What is a scientific fact? Above all, it should be provable, i.e. confirmed or falsified by one's own data and those of other scientists. And indeed, most scientists spend their day providing data to confirm their conclusion. Karl Popper (1902–94) taught us that this is not it.4,5 In fact, observations have to survive the proof of time; the scientific process develops along conjectures and refutations. The statement that all swans are white was falsified with the discovery of black swans in Australia.6,7 Another important aspect of discovery is the fact that it always evolves within a paradigm,8 i.e. in medicine, a basic concept of what the major causes and mechanisms of a disease are. Importantly for what is discussed below, by shear probability new claims are more likely to be falsified; thus, the road to evidence is, by design, a bumpy one.

And indeed, innovations are difficult to predict. Simon Newcomb (1835–1909), a physicist of the 19th century, said: ‘Flying with maschines that are heavier than air is without practical importance and senseless, if not completely impossible.’ Yet today, about 3 billion passengers travel by air each year. Around the same time, the famous surgeon Theodor Billroth said: ‘That surgeon who ever would attempt to sew a wound of the heart can be sure of losing any respect of his colleagues for ever.’ Yet today, more than a million cardiac operations are performed every year, not to mention catheter interventions.

Particularly in an applied science such as medicine, the practical consequences of a theory are as important as the fact that it has stood the test of time. Indeed, Anitchkow's seminal observation in rabbits fed a high-fat diet led to the concept that fat or its components will lead to atherosclerosis and its complications, such as myocardial infarction and stroke.9 The Framingham study prospectively studied this relationship and found an association also in humans.10 However, only the discovery of 3-hydroxy-3-methyl-glutaryl-coenzyme A inhibitors, the statins, enabled studies that eventually proved a causal relationship.11 Ever since then, the difference between association and causality has been stressed.12

In medicine, different levels of evidence have been distinguished (Figure 1), as follows: clinical intuition, unsystematic clinical experience, case reports and patient cohorts, pathophysiological concepts (the basis of most paradigms in medicine),8 small trials and meta-analyses or systematic reviews thereof, and finally, large randomized trials.

Figure 1

Levels of evidence in clinical medicine.

James Lind and scurvy

Scurvy is a disease which leads to open sores and loss of movement, a condition which, until the 19th century, was particularly prevalent among sailors and soldiers. The ship's surgeon of the British Royal Navy, James Lind, was the first to find a cure for the disease. While at sea in May 1747, Lind treated some of his sailors who were suffering from scruvy with oranges and lemons, while others received cider, vinegar, sulfuric acid or seawater, along with their usual food. Historically, this has to be considered the first randomized (although not blinded), controlled trial. In spite of the very small number of patients involved, the results conclusively showed that citrus fruits prevented the disease. Lind published his observation in 1753 in his ‘Treatise on the Scurvy’.13

Lind's approach was not adopted by medical researchers until the 20th century, when Austin Bradford Hill (1897–1991), an English epidemiologist and statistician, set out to test the effects of the recently discovered streptomycin in patients with tuberculosis. At this point, it had been recognized that biases of both the patient and the treating physician may influence the perception of the effectiveness of medical interventions. Tuberculosis was an endemic disease commonly treated by bed rest on the ‘Magic Mountain14 and other institutions, mostly at high altitude. Hill wanted to prove the advantages of streptomycin compared with that standard treatment and developed the principle of randomization to exclude as many biases as possible. The results were a breakthrough, both for his scientific approach and for the treatment of tuberculosis. The trial lasted 6 months and involved 52 controls treated by bed rest and 55 patients receiving 2 g of streptomycin four times daily. As he wrote in his seminal article published in the British Medical Journal in 1948:15The difference between the two series is statistically significant; the probability of it occuring by chance is less than one in a hundred.’ Ever since then, randomization (today also with blinding of doctors and patients, if possible) and statistical analysis have become the cornerstones of clinical research.

From bench to bedside

It is obvious that any clinical trial rests on the results of often decades of basic research. Indeed, there are several levels of science involving genes, proteins, organelles and cells, tissues, organs, and finally, patients and populations (Figure 2). Evidence has to evolve over many steps in order to graduate from bench to bedside. This process is not unidrectional; indeed, a seminal observation may, for instance, start in a tissue, such as a blood vessel of an animal, where a novel phenomenon, e.g. endothelium-dependent relaxation, is observed,16 then may progress to organs and organisms,17 only to move down to the molecular level, where the responsible protein, i.e. endothelial nitric oxide synthase,18 is discovered. Clinical trials,19,20 Mendellian randomization studies and others, may follow later.

Figure 2

Levels of reserch from bench to bedside.

In drug discovery, the process is more unidirectional, starting with basic research in cells, tissues, and animals and then moving to phase I studies focusing on safety, pharmacokinetics, and haemodynamics. The dose–effect relationship is then investigted in phase II, while clinical endpoints are the focus of phase III studies (Figure 3). Dose is a difficult issue, particularly in the absence of reliable surrogate endpoints. Indeed, dosages used in vitro or in animal models are often several mangitudes higher than those effective and tolerated in humans.

Figure 3

Levels of drug-development programmes.

Lost in translation

When moving through all these levels of research and development, concepts, drugs, and devices may be lost in translation. Why are things lost in translation? First, the hypothesis may be wrong, then the animal models used may not reflect human disease, the data may not be solid or may even be fraudulent,21 endpoints may not have been well chosen, unrecongized off-target effects may suddenly appear that outweight the benefits, and finally, there are miscellaneous reasons.

Let us look at a few examples. In the 1980s and 1990s, restenosis after angioplasty or stenting was a real issue. At the same time, the vascular effects of angiotensin II were discovered, and researchers at Roche published an article in Science22 demonstrating that the angiotensin-converting enzyme (ACE) inhibitor cilazapril prevented intimal hyperplasia induced by vascular injury in the carotid artery of the rat. Swiftly, several clinical trials, MERCATOR23 and MARCATOR,24 were set up to prove these findings in patients undergoing angioplasty. As it turned out, cilazapril at either a high or a low dose did not prevent restenosis or improve outcomes. Obviously, the rat carotid artery was an inappropriate model, because most interventions worked that later proved ineffective at the clinical level. Once a new molecule, rapamycin, proved effective in the pig model of stent restenosis, the results could be confirmed at the clinical level.

Things can get worse; for example, TGN1412 (or CD28-SuperMAB) was the working name of a suppossedly immunomodulatory drug originally intended for the treatment of B cell chronic lyphocytic leukaemia or rheumatoid arthritis.25 In March 2006, six volunteers were entering a phase I trial at Northwick Park Hospital as the first humans to receive the drug. Unexpectedly, the drug caused catastrophic systemic organ failure due to a massive cytokine storm in the volunteers enrolled, despite it being administered at a dose 500 times lower than that found safe in animals. Obviously, the problems resulted from biological actions in humans not foreseen from the experiments in rats and mice.

Although mice share about 80% of their their working DNA with humans,26 they are obviously at best a model; they are not humans. They may differ substantially in some respects, while they may be similar in others. Crossing the species border is always a risk in translational research.

Off-target effects are another problem. Often, novel therapeutic targets are identified in very distinct experimental settings. For instance, cyclo-oxygenase-2 inhibitors were developed to reduce bleeding and increases in blood pressure associated with the use of non-steroidal anti-inflammatory drugs. Although these aims were partly achieved, rofecoxib was associated with an increased incidence of myocardial infarction, possibly due to prothrombotic effects,27 while celecoxib is still being tested in the large PRECISION trial.28 With an ever older population being treated in cardiovascular medicine, co-morbidities and drugs used to treat those, cardiovascular safety becomes an increasing issue that needs to be considered in any drug-development programme.

Another example is provided by drugs that raise high-density lipoprotein cholesterol (HDL-C) (Lüscher et al. in press). Epidemiologically, it appeared obvious that raising HDL-C would reduce cardiovascular events in patients at risk. The cholesterol ester transport protein inhibitors therefore raised big hopes;29 and indeed, the first in its class, torcetrapib, more than doubled HDL-C levels, but unexpectably increased mortality.30 Later, basic research showed that torcetrapib increased aldosterone release from the adrenal glands and endothelin release from the vasculature, while suppressing endothelial nitric oxide synthase expression and endothelial function.31 These effects were considered off-target, because other molecules of the same class, such as dalcetrapib, did not share these properties. In phase II studies, such as Dal-Vessel, dalcetrapib increased HDL-C by 30% in patients with hyperlipidaemia and low HDL-C, while leaving blood pressure unchanged.32 However, dalcetrapib did not improve endothelial dysfunction or supress markers of inflammation. In line with this, dalcetrapib was ineffective in patients after acute coronary syndromes in the large Dal-Outcomes trial.33 Thus, it is likely that HDL-C dysfunction in patients with acute coronary syndromes may explain the neutral results, in spite of a marked rise in the lipoprotein.34 Indeed, another HDL-C-raising drug, i.e. niacin, proved equally ineffective in two large outcome trials.35,36 Thus, HDL-C may be a marker rather than a therapeutic target, unless a drug also improves HDL-C dysfunction and protein composition37 in patients with acute coronary syndromes or coronary artery disease, or the HDL-C paradigm is wrong altogether.

Therefore, surrogate endpoints are an important need (Table 1).38,39,40,41,42 While blood pressure43 and low-density lipoprotein cholesterol are accepted surrogate endpoints predicting risk, and when lowered pharmacologically lead to a reduced risk, the record of other surrogates is less convincing.

View this table:
Table 1

Surrogate endpoints and their validity to predict major cardiovascular outcomes in different areas of cardiology

DiseaseSurrogate endpoint (changes in)Validity
HypertensionBlood pressure43++++
Carotid intima–media thickness43++
Microalbuminuriaa++ /?
Flow-mediated dilatationb,38++
Left ventricular hypertrophy (ECG, echocardiography, magnetic resonance imaging)43++
LipidsLow-density lipoprotein cholesterol42+++
High-density lipoprotein cholesterol
Carotid magnetic resonance imagingb++
Intravascular ultrasound40++
Coronary computed tomography39?
Optical coherence tomography41?
Haemoglobin 1Ac++
Coronary artery diseaseQuantitative coronary angiography++
Intravascular ultrasound40++
Coronary computed tomography39?
Optical coherence tomography41?
Acute coronary syndromesTroponins++
Brain natriuretic peptide++
Infarct size (late enhancement in magnetic resonance imaging)?
Heart failureExercise capacity46,47
Haemodynamics (cardiac output etc.)47
Ejection fraction
Remodelling (left ventricular end-systolic volume)49++
Brain natriuretic peptide47,48++
Sudden deathPremature ventricular beats44
Late potentials
Non-sustained ventricular tachycardia on Holter
  • The symbols ‘ + ’ to ‘ + ++ + ’ indicate the degree of predictability of a change in each parameter for a change in major cardiovascular events; the symbol ‘?’ indicates currently unknown.

  • aAlthough considered predictive by many,42 in the ROADMAP trial, microalbuminuria changed favourably in spite of a neutral to negative effect on mortality (N Engl J Med 2011;364:907–917).

  • bFlow-mediated dilatation was predictive in many situations except with estrogens and calcium antagonists, but it recently predicted the failure of darusentan.31 Its reproducibility depends on the experience of the core laboratory.38

  • cIn the Dal-Plaque study (Lancet 2011;378:1547–1559), carotid magnetic resonance imaging changed slightly, but favourably, in response to darusentan, while the large outcome trial was neutral.33

  • dCoronary computed tomography is highly predictive of future cardiovascular events, but its use in therapeutic trials has not yet been studied properly.39

For instance, in the 1970s, ventricular ectopic beats were considered ideal surrogates for patients at risk of sudden cardiac death, until the CAST trial44 and the SWORD trial,45 using antiarrhythmic drugs, such as encainide, flecainide, moricizin, or d-sotalol, respectively found an increased rather than decreased mortality, in spite of effective suppression of ventricular ectopic beats.

Likewise, in heart failure, exercise performance, haemodynamic improvements, and left ventricular ejection fraction have been used with disappointing results. Indeed, many drugs, such as inotropes or phosphodiesterase inhibitors, improved haemodynamics and exercise prerformance, but increased mortality.46 Thus, the paradigm of stimulating the heart was falsified, while the concept of unloading it proved effective.47 It appears that only a reduction of left ventricular remodelling and of brain natriuretic peptide48 have some predictive value for clinical outcome in this patient population. Indeed, the lack of remodelling with endothelin antagonists49 predicted negative outcome studies.50

Why many research findings prove to be false

Thus, as predicted by Karl Popper,4 most research findings are eventually falsified at different levels of research. While some hypotheses and paradigms survive the entire process, others have to be dismissed at initial or later stages. The reasons are multiple, but may include the following:51 (i) inappropriate experimental models (i.e. cellular systems, animals); (ii) irreproducible findings (i.e. overstated, selected or fraudulent data, large number of tested relationships, ‘hot’ scientific field); (iii) study design (i.e. comparator groups, small sample size, wrong study population, extended flexibility in definitions of outcomes); (iv) small effect size; and (v) overwhelming intellectual or financial interests.52,53

As outlined above, many findings in cellular systems and animal models cannot be reproduced at the clinical level, because they are limited to specific experimental conditions, due to species differences or inappropriate modelling of human disease. Currently, for convenience, costs, and legal as well as regulatory constraints, mainly rodents are used, although pigs and primates are closer to humans in many respects. Biological systems in animal models should be more carefully evaluated regarding their similarity to human biology. Possibly, humanized mice may be helpful.54

Furthermore, not all research findings are reproducible, because they may require very specific experimental settings, are seen only in certain, but not other cell lines or animal strains, or have been overstated due to the enthusiasm of the investigators. Not uncommonly, parts of the results are not presented in the published manuscript in order to pass the rigid peer review process, or certain experimental data are even suppressed.55 Of note, an increasing number of manuscripts have had to be retracted after publication, particularly in high-impact journals.56

In response to that unfortunate trend, C. Glenn Bayley recently published the six red flags to test scientific findings.57 First, are experiments performed blinded? Second, were basic experiments appropriately repeated? Third, were all results presented? Fourth, were there positive and negative controls? Fifth, were reagents validated? And sixth, were statistical tests appropriate? It is obvious that bench experiments cannot be or are rarely performed in a blinded manner. Amazingly, suppression of data is common practice, as acknowledged by almost a third of the participating scientists in an anonymous survey.55 Obviously, there may be good scientific reasons to do so, but then it should be clearly stated in the Methods section. That reagents have to be validated has recently been stressed by a study showing that dimethyl sulfoxide, a commonly used solvent, has profound biological effects.58 Statistics of all seriously considered papers are currently checked by specialized editors in all high-impact journals, including the European Heart Journal.59,60 Finally, in ‘hot’ scientific fields there is a clear danger of publishing too quickly and too enthusiastically, as again stressed by fraud scandals affecting stem cell research.61

At the clinical level, the study design is particularly important. In general, non-randomized and smaller randomized studies are more commonly refuted by later research.51 For instance, an initial case–control study involving 1334 patients and controls suggested that an ACE polymorphism was associated with an increased incidence of myocardial infarction,62 a finding that became smaller and eventually absent in larger subsequent studies involving more than 10 000 individuals.63 Registry data, even when analysed using modern statistics, such as propensity analysis,64 are less reliable, although they reflect current practice. For instance, the nurses' health study suggested that hormone replacement therapy was protective in post-menopausal women,65 a finding not confirmed in large randomized outcome trials.66 Most probably, hormone use was a reflection of the health consciousness of the participants and not a causal factor.12 Furthermore, post-marketing registries of novel compounds are prone to over-reporting, thereby providing a distorted estimate compared with established treatments.67

Even randomized studies may be refuted over time, particularly the smaller ones. For instance, the QUIET trial,68 involving 1750 patients with cardiovascular disease, found no benefit of ACE inihibition, while the HOPE trial,69 enrolling 9297 patients, was positive. In clinical papers, appropriate power calculation is particularly crucial and increasingly difficult as event rates have dropped continously in the last decades due to the increasing use of evidence-based therapies. Thus, larger and larger patient populations are required, particularly with non-inferiority designs.70 Thus, for ethical and financial reasons, reliable surrogate endpoints with a high predictive value for major cardiovascular events would be crucial.

A major drawback of the results of clinical trials, when eventually translated into clinical practice, is the fact that only a minority of qualifying patients are enrolled and that those who are enrolled differ from non-participants. Study patients have different baseline characteristics with regard to age, co-morbidities, and drug treatments, among other factors, and accordingly, have a lower mortality and lower event rate than non-participants.7173 Finally, depending on inclusion and exclusion criteria, as well as the outcomes definitions used, the results of different trials may be difficult to compare. Thus, attempts have been made to harmonize major outcomes, as well as definitions of bleeding.74,75 Such attempts are crucial for comparison and for the further evaluation of trial results in all-comers registries.

Finally, it has been suggested that conflicts of interest affect results. Conflicts may be intellectual (mainly in basic and pathphysiological research), professional (mainly in device- and equipment-based research), and/or financial in nature. Particularly in clinical trials, the design may already be influenced by financial considerations (i.e. comparator, dose of comparator, patient population etc.).

The role of scientific journals

It is the aim of the peer review process common to all respected scientific journals to assess research findings critically with regard to their validity, importance, and novelty.76,77 Great care has to be taken in evaluating the design, methodology, and data analysis of a given manuscript in order to assure that the data are valid and eventually reproducible. Although less than perfect, the peer review process, particularly when involving three reviewers and knowledgeable editors, may pick up flaws and provide advice on how to improve the manuscript. Nevertheless, the process cannot completely avoid the possibility that some of the published papers are irreproducible or even have to be retracted.21,56 Increasingly, journal editorial offices receive allegations from other authors or whistleblowers on the validity or reproducibility of findings. To address this issue, the ESC Journal Family has initiated an independent Ethics Board, where such allegations will be handled.78

There are conflicts of interest for editors also that may endanger a proper peer review and selection of manuscripts. In particular, papers reporting novel data of ‘hot’ areas may be accepted with lesser stringency. For instance, as with gene therapy research in the 1990s, stem cell research currently attracts a lot of attention; hence, even studies with minimal patient numbers are accepted.79 Furthermore, pressures from industry, either open or under cover, may affect editorial decisions.80 Thus, editors must be aware of these potential biases.

Thus, in summary, the road to evidence is long and winding indeed. As the evidence base of clinical medicine has grown, the process has become even longer and bumpier; as event rates have dropped and the most obvious facts have been discovered, it has become increasingly difficult to demonstrate incremental novelty and/or benefit beyond what has already been achieved. This may explain the increasing number of neutral trials in today's cardiovascular research. However, the lessons learned from the past may be helpful for discovery programmes of the future.


  • The opinions expressed in this article are not necessarily those of the Editors of the European Heart Journal or of the European Society of Cardiology.