Author + information
- Received September 24, 2015
- Revision received October 25, 2015
- Accepted October 25, 2015
- Published online December 29, 2015.
- ∗Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, United Kingdom
- †Columbia University Medical Center, New York-Presbyterian Hospital, and the Cardiovascular Research Foundation, New York, New York
- ↵∗Reprint requests and correspondence:
Prof. Stuart J. Pocock, Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, London, WC1E 7HT, United Kingdom.
As a sequel to last week’s paper on the fundamentals of clinical trial design, this paper tackles related controversial issues: noninferiority trials, the value of factorial designs, the importance and challenges of strategy trials, Data Monitoring Committees (including when to stop a trial early), and the role of adaptive designs. All topics are illustrated by relevant examples from cardiology trials.
- adaptive designs
- Data Monitoring Committees
- factorial designs
- noninferiority trials
- randomized controlled trials as topic
- statistical stopping guidelines
- strategy trials
Randomized controlled trials are the cornerstone of clinical guidelines informing best therapeutic practices; however, their design and interpretation may be complex and nuanced. This review explores challenging issues that may arise and builds on the fundamentals of trial design covered in last week’s paper.
Specifically, we offer guidance on how to design and interpret noninferiority trials where the goal is to demonstrate that the efficacy of a new treatment is as good as that achieved with a standard treatment.
Factorial trials, where 2 (or more) therapeutic issues are simultaneously evaluated in the same study, present an interesting opportunity that should be considered more often in cardiology research.
Trials that compare substantially different alternative treatment strategies can be of great value in enhancing good patient management, and we present guidance on the topic to stimulate greater interest in overcoming the difficulties in undertaking such pragmatic studies.
All major cardiology trials have both ethical and practical needs for data monitoring of the accumulating evidence over time. We provide insights into how Data Monitoring Committees (DMCs) should function, offering statistical guidelines and practical decision-making considerations as to when to stop a trial early.
Finally, there is a growing interest in adaptive designs, but few instances of their implementation in cardiology trials. We focus on adaptive sample size re-estimation and enrichment strategies, with guidance on when and how they may be used.
All of these issues are illustrated by experiences from actual cardiology trials, demonstrating the real-world implications of trial design decisions.
Increasingly, major trials are conducted to see if the efficacy of a new treatment is as good as a standard treatment (1–3). The new treatment usually has some other advantage (e.g., fewer side effects, ease of administration, lower cost), making it worthwhile to demonstrate noninferiority in respect to efficacy.
The standard approach to designing a noninferiority trial is to pre-define a noninferiority margin, commonly called delta, for the primary endpoint. This is the smallest treatment difference, which, if true, would mean that the new treatment is declared inferior. This is on the basis of the belief that any difference smaller than this would constitute clinically accepted grounds of “therapeutic interchangeability” (4). The trial’s conclusions then depend on where the 95% confidence interval (CI) for the treatment difference ends up in relation to this margin. If the upper bound of the 2-sided 95% CI is less than delta, one can claim evidence that the new treatment is noninferior.
For instance, the ACUITY (Acute Catheterization and Urgent Intervention Triage Strategy) trial compared bivalirudin with the standard treatment of heparin plus a glycoprotein IIb/IIIa inhibitor in patients with acute coronary syndrome (ACS) for 30-day composite ischemia (death, myocardial infarction [MI], or revascularization) (5). The noninferiority margin was set at a relative risk of 1.25. The trial’s findings revealed composite ischemia rates of 7.8% and 7.3% in the bivalirudin and control groups, respectively, with relative risk: 1.08; 95% CI: 0.93 to 1.24. Because the upper bound of the CI of 1.24 was less than the pre-declared delta of 1.25, one can conclude that there is evidence of noninferiority. The reason this matters is that bivalirudin also had a markedly lower risk of major bleeding, an important consideration when choosing between antithrombin therapies.
A common misunderstanding is that lack of a statistically significant difference between 2 therapies implies that they are equivalent. For instance, the INSIGHT (Intervention as a Goal in Hypertension Treatment) trial compared nifedipine with co-amilozide in hypertension. The authors concluded that the treatments were “equally effective in preventing cardiovascular complications,” on the basis of a p value of 0.35 for the primary composite endpoint of cardiovascular (CV) death, MI, heart failure, or stroke (6). But, the observed relative risk of 1.10 had a 95% CI of 0.91 to 1.34. This includes up to a 34% excess risk on nifedipine, making it unwise to conclude that nifedipine is as good as (i.e., noninferior to) co-amilozide.
Figure 1 shows a conceptual plot of how to interpret the results of noninferiority trials. Scenario C (noninferior) indicates what happened in the ACUITY trial. If we suppose that the INSIGHT trial had the same delta, 1.25, then it would have fallen under scenario F (inconclusive). Had more patients been enrolled, the 95% CI would have narrowed, and noninferiority might then have been declared.
Sometimes, the treatment effect (and its delta) is expressed as a difference in percentages, rather than as a relative risk or hazard ratio (the argument being that absolute differences are more clinically relevant than relative risks). For instance, the OPTIMIZE (Optimized Duration of Clopidogrel Therapy Following Treatment With the Zotarolimus-Eluting Stent in Real-World Clinical Practice) trial compared a 3-month versus a 12-month duration of dual antiplatelet therapy after implantation of a zotarolimus-eluting stent (7). For the composite primary endpoint of net adverse clinical events (death, MI, stroke, or major bleed) at 1 year, a 2.7% difference was set as the noninferiority margin. The observed difference was +0.2%, with a 95% CI of −1.5% to +1.9%. Because this excludes the margin of +2.7%, noninferiority of the 3-month duration of treatment was claimed.
This example raises a few issues. When the noninferiority margin is a difference in percentages, it becomes easier (perhaps too easy) to achieve noninferiority if the overall event rate is lower than expected. The OPTIMIZE trial had an anticipated 9% event rate in the control arm, but the observed event rate was 6%. This made the 2.7% margin equivalent to a relative risk margin of 1.45, which is undesirably large. Conversely, if the overall event rate is greater than expected, it may become unreasonably difficult to achieve noninferiority. The opposite considerations of anticipated versus observed event rates apply if a relative risk is chosen for the margin.
Also, the endpoint chosen in the OPTIMIZE trial was not of optimal relevance. The true issue in considering a shorter period of dual antiplatelet treatment concerns the balance between the increased risks of stent thrombosis and MI against the reduced risk of major bleeding. To force these diverse endpoints into a single composite would bias results toward the null. A preferable approach is to pre-specify and study separately-powered efficacy and safety endpoints, typically 1 for superiority and 1 for noninferiority. However, a very large sample size may be required to adequately power both the efficacy and safety endpoints.
A composite net adverse clinical events endpoint, consisting of combined safety and efficacy endpoints, has been used in some trials, reflecting the recognition that both types of endpoints (e.g., major bleeding and stent thrombosis) are deleterious and strongly associated with subsequent mortality. However, interpretation of such a combined safety and efficacy endpoint may be challenging, especially if the different components do not have similar effects on patients’ well-being or survival. Moreover, because safety and efficacy endpoints often move in different directions (e.g., in response to more potent antithrombotic therapies), their combination in a composite endpoint may mask differences between therapies, making careful examination of each component measure essential.
A key question is the choice of noninferiority margin, which has implications for the required trial size. Power calculations for noninferiority trials (not presented here) indicate that trial size is inversely proportional to the square of the margin delta. For instance, had ACUITY chosen a 10% increase, rather than a 25% increase (i.e., relative risk 1.1, rather than 1.25), more than 6× as many patients would have been required for the same power (i.e., >50,000 in total). Thus, the choice of margin requires a realistic balancing of scientific goals with an achievable sample size.
The choice of margin is sometimes related to prior knowledge of the efficacy of the active control compared with placebo. A sensible goal is that the new treatment should preserve at least 50% of the effect demonstrated in prior trials of the control treatment against placebo (the so-called “putative placebo” approach). For instance, in the CONVINCE (Controlled Onset Verapamil Investigation of Cardiovascular End Points) trial of verapamil versus standard antihypertensive treatment with a diuretic agent or beta-blocker, the noninferiority margin for the composite of stroke, MI, or CV death was set at a hazard ratio of 1.16 (8). This was because of the need for evidence that verapamil was at least one-half as effective as the standard treatment, relative to placebo. Regulatory agencies accept this method to establish a noninferiority margin and provide guidance for its determination (1).
In addition to the assumed event rates, margin, and desired power, the sample size of a noninferiority trial depends on whether the delta will be tested against the upper bound of a 1- or 2-sided 95% CI (the latter being equivalent to a 1-sided 97.5% confidence limit). The latter conservative approach is the standard for regulatory approval of new pharmaceuticals (and many devices). However, some devices, such as the FilterWire EX system (Boston Scientific, Marlborough, Massachusetts) to prevent distal embolization during percutaneous coronary intervention (PCI) of diseased saphenous vein grafts, which was examined in the FIRE (FilterWire EX Randomized Evaluation) trial (9), have been approved on the basis of a noninferiority design with a 1-sided alpha of 5%. Utilizing a 1-sided alpha of 5%, rather than 2.5%, reduces the sample size by approximately 20%, although this is generally frowned upon. Accepting greater alpha error may be acceptable, however, when the experimental device provides additional benefits not evident in the primary endpoint.
A noninferiority design may also be applied to exclude a safety concern in a treatment with known efficacy. Such safety trials can include comparison of the experimental agent to an active comparator. (e.g., as in the ENTRACTE [A Clinical Outcomes Study to Evaluate the Effects of IL-6 Receptor Blockade With Tocilizumab in Comparison With Etanercept on the Rate of Cardiovascular Events in Patients With Moderate to Severe Rheumatoid Arthritis] trial performed to exclude excess CV risk for tocilizumab compared with etanercept in patients with rheumatoid arthritis) (10). But, in type 2 diabetes, U.S. Food and Drug Administration guidance requires assessment of the CV risk of any new drug relative to placebo (11). Many such placebo-controlled trials in high-risk patients who are already on appropriate antiglycemic therapy are either currently in progress or recently completed. The primary safety endpoint is typically the composite of CV death, MI, and stroke, and the noninferiority margin is set at a hazard ratio of 1.3. This requires a trial of many thousands of patients, because approximately 700 primary events are needed to provide convincing evidence of noninferiority. For a new, effective antidiabetic drug, the U.S. Food and Drug Administration also requires preliminary evidence of CV safety for initial approval, using a hazard ratio noninferiority margin of 1.8. The larger safety trial to confirm noninferiority on the basis of the tougher margin of 1.3 then ensues.
It is sometimes argued that noninferiority trials should emphasize a per-protocol (or as-treated) analysis, rather than analysis by intention-to-treat, thereby excluding any follow-up after a patient withdraws from randomized treatment (or after a short period following withdrawal to capture rebound events). The logic is that including off-treatment follow-up (possibly with crossovers) may dilute any real treatment differences, thereby artificially enhancing any claim of noninferiority. However, per-protocol and as-treated analyses introduce other biases. We suggest that both types of analyses be presented in noninferiority trials, hopefully demonstrating a consistency of findings.
When undertaking a noninferiority trial, one can also propose a superiority hypothesis with no statistical penalty. That is, once the trial results confirm noninferiority, one can go on to test for superiority (see scenario A in Figure 1). For instance, some CV safety trials of antidiabetic drugs have been made larger to accommodate this superiority hypothesis. One such trial (EMPA-REG OUTCOME [A Randomized, Placebo-controlled Cardiovascular Outcome trial of Empagliflozin]) of empagliflozin versus placebo recently demonstrated some evidence of a reduction in the primary endpoint of CV death, MI, or stroke, with a hazard ratio of 0.86 (95% CI: 0.74 to 0.99; p = 0.04), while also showing a significant reduction in all-cause death with a hazard ratio of 0.68 (95% CI: 0.57 to 0.82; p < 0.001) (12).
Sometimes, one can pursue 2 separate treatment comparisons within the same major trial by randomizing each patient twice: once to treatment A versus its control, and at the same time, to treatment B and its control. This is known as a 2-way factorial design (13,14). Factorial designs have numerous practical benefits, such as adding in a second randomization within the framework of a trial funded for a different purpose, affording the opportunity to investigate an inexpensive treatment that would otherwise be difficult to fund and test in its own trial. For instance, the HOPE (Heart Outcomes Prevention Evaluation) factorial trial studied ramipril versus placebo and then also vitamin E versus its placebo in high-risk patients (15,16). Ramipril significantly reduced CV events, whereas vitamin E did not.
In planning a factorial design, one presumes that the treatment effect in 1 randomized comparison is not likely to depend on the other randomized treatment: that is, there is no expectation of an interaction between the 2 randomized treatments. Thus, the trial is powered to examine the main effects of the 2 randomized comparisons separately. By doing so, one neatly gets “2 trials for the price of 1”; that is, in principle adding in the second randomization does not increase the trial size. In practice, it may be wise to somewhat inflate trial size when a factorial design is contemplated because: 1) if both treatments are effective, the overall event rate will be lower; and 2) one may wish to guard against a modest quantitative interaction being present.
The CURRENT OASIS 7 (Clopidogrel and Aspirin Optimal Dose Usage to Reduce Recurrent Events−Seventh Organization to Assess Strategies in Ischemic Syndromes) trial randomized 25,086 ACS patients referred for an invasive strategy to both: 1) double-dose versus standard-dose clopidogrel; and 2) higher-dose versus lower-dose aspirin (17). The primary outcome was CV death, MI, or stroke within 30 days, and the findings are shown in Table 1. The 2 main effect analyses showed that neither the clopidogrel dose nor the aspirin dose appeared to have any effect on the primary endpoint (p = 0.30 and p = 0.61, respectively). Exploring the potential interaction between the 2 drug doses, however, revealed a curious finding: the observed event rate was lower on double-dose than standard-dose clopidogrel (3.8% vs. 4.6%) when given with higher-dose aspirin, but this was reversed (4.5% vs. 4.2%) when given with lower-dose aspirin. This apparent qualitative interaction did reach conventional statistical significance: interaction p = 0.04. The authors believed that this unexpected finding lacks a known biological mechanism and may be due to the play of chance, which is a reasonable supposition. Conversely, if a possible biological explanation for the interaction may be posited, the validity of the conclusions drawn from both arms may be jeopardized, an inherent risk of factorial designs. Factorial designs should therefore only be contemplated when the expectation of a real interaction between the 2 therapies is low. In principle, one can still undertake a factorial trial when a plausible interaction between the 2 treatment factors is contemplated, but this would require a major increase in trial size to be adequately powered to detect such an interaction.
Another useful option is a partial (or nested) factorial design, where all recruited patients get 1 random treatment allocation, but only some patients are eligible for the second randomized treatment. For instance, the HORIZONS-AMI (Harmonizing Outcomes with Revascularization and Stents in Acute Myocardial Infarction) trial randomized 3,602 ST-segment elevation MI patients to bivalirudin versus heparin plus a glycoprotein IIb/IIIa inhibitor (in a 1:1 ratio) (18,19). Among these patients, 3,006 met additional anatomic inclusion criteria and underwent a second randomization to PCI with paclitaxel-eluting versus bare-metal stents (in a 3:1 ratio).
Occasionally the factorial design can take on more than 2 treatment factors. For instance, the ISIS-4 (Fourth International Study of Infarct Survival) randomized 58,050 patients with MI to: 1) oral captopril versus placebo; 2) oral mononitrate versus placebo; and 3) intravenous magnesium sulfate versus open control in a 2 × 2 × 2 factorial design (20). Finally, the MATRIX (Minimizing Adverse Haemorrhagic Events by Transradial Access Site and Systemic Implementation of AngioX) trial is an example of a 3-level randomization with a nested factorial approach. In MATRIX, 8,404 patients with ACS undergoing cardiac catheterization were randomized to radial versus femoral vascular access. Among this group, 7,213 patients in whom PCI was selected for treatment were randomized again to procedural anticoagulation with heparin versus bivalirudin. Finally, the 3,610 bivalirudin-assigned patients were randomized a third time to either a post-procedural prolonged bivalirudin infusion or to no infusion (21,22).
When circumstances are right, the factorial design is a useful means of investigating 2 (or more) different treatment innovations within 1 trial. Overall, trialists need to give more attention to the imaginative use of factorial designs.
Trials of Alternative Treatment Strategies
Trials of fundamentally different treatment strategies, for example, surgery versus PCI or medical therapy, or invasive versus conservative approaches in patients with ACS, are an exciting challenge and can have a substantial effect on guidelines and clinical practice (23,24). Such “strategy” trials are, however, more difficult to undertake than studies comparing different drugs or different devices to each other.
When the randomized strategies differ substantially in their perception by both investigators and patients, particular challenges arise. Investigators (often across specialties [e.g., cardiac surgeons and interventional cardiologists]) need to accept that the patient may truly receive either strategy without being disadvantaged (i.e., a state of equipoise is indeed present). Even if solid evidence is lacking, physicians (and patients) may express strongly held beliefs in the superiority of one treatment compared to another, on the basis of anecdotal experiences or reports, nondefinitive evidence (e.g., uncontrolled observational comparisons or small randomized trials), or prior positive trials using surrogate endpoints. These preconceived beliefs can make enrollment more difficult and may result in a biased cohort being recruited. Obtaining informed patient consent is also less routine in strategy trials than in standard randomized drug or device studies. Strategy trials also typically require multidisciplinary cooperation, greater resources, and a longer period for full recruitment, and are thus more expensive. Strategy trials often lack a single funding source from industry, and therefore often require pure governmental and/or institutional support, collaboration between multiple companies, or a private-public partnership. Thus, major challenges in strategy trials include randomizing a high enough proportion of eligible patients in a reasonable timeframe, and raising appropriate funds.
For instance, ISCHEMIA (International Study of Comparative Health Effectiveness With Medical and Invasive Approaches) is a major multinational trial of routine invasive versus conservative strategies in patients with stable coronary disease and at least moderate ischemia (25). A strong evidence-based case can be made for either approach in such patients (26). A prior survey of interested cardiologists asked if they would enroll their eligible patients in a randomized trial with a 50% chance of being conservatively managed without cardiac catheterization; 80% responded positively (27). The ISCHEMIA trial initially planned to recruit 8,000 patients, but after more than 2 years, only ∼2,000 patients have been randomized, which may require a protocol amendment to reduce the sample size. Such lower-than-desired recruitment is a common problem with strategy trials.
Strategy trials are particularly important when evaluating a new therapeutic approach. For instance, transcatheter aortic valve replacement has emerged as an alternative to surgical aortic valve replacement in patients at high and prohibitive operative risk (28,29). Ongoing trials are now being performed in patients at lower surgical risk. Key aspects here are to decide when in the learning curve of such a new technology one should undertake such a trial; to define the risk profile of patients that should initially be recruited; and to create the right collaborative atmosphere for general cardiologists, interventionalists, and surgeons to participate.
The results of strategy trials require careful interpretation, especially when crossovers occur. For instance, the COURAGE (Clinical Outcomes Utilizing Revascularization and Aggressive Drug Evaluation) trial studied optimal medical therapy (OMT) with and without initial PCI in 2,287 patients with stable coronary disease (30). The primary endpoint, the composite rate of death or nonfatal MI, showed no significant difference between the PCI and medical therapy groups after a median 4.6 years of follow-up. A naive interpretation is that PCI is no better than medical therapy (and thus PCI should never be performed), but this ignores the strategic concept of the trial. In the COURAGE trial, 32.5% of patients assigned to OMT went on to receive revascularization (mostly PCI) during follow-up, primarily for progressive or unstable symptoms. Thus, the trial really compared “PCI (plus OMT) now” with “OMT now, with the option of later PCI (or coronary artery bypass graft), as needed.” The pure question “does PCI improve prognosis?” is not directly answerable because the investigators could not continue with medical therapy alone.
An additional concern of particular relevance to strategy trials is that given their inherently protracted nature (slow recruitment with long follow-up), the standard of care frequently evolves prior to their finish. For instance, in the SYNTAX (Synergy Between Percutaneous Coronary Intervention with Taxus and Cardiac Surgery) trial, coronary artery bypass graft was shown to be superior to PCI using a first-generation paclitaxel-eluting stent (31). However, by the time the SYNTAX trial was completed, second-generation drug-eluting stents had been developed, which have been associated with reduced rates of death, MI, and repeat revascularization compared with paclitaxel-eluting stents (32). Studies have suggested that this advance alone might have eliminated the difference between the 2 strategies (33). Confirming such a hypothesis requires performance of another time-consuming and costly randomized trial, which, in turn, risks further advances in technology before its completion.
Despite the practical difficulties in undertaking randomized trials of alternative strategies, they are of key importance in evaluating radically different approaches to patient care. Otherwise, we are forced to rely on nonrandomized comparisons on the basis of patient registries. They, too, provide a wealth of interesting data, but always with the caveat that substantial selection bias is typically present, resulting in unmeasured confounders that cannot be accounted for in statistical analysis (34,35).
One exciting development is the growth of pragmatic trials that are embedded within routine care delivery (i.e., trials with patient registries, such as the TASTE (Thrombus Aspiration in ST-Elevation Myocardial Infarction in Scandinavia) trial of thrombus aspiration for MI ). Such trials greatly enhance patient representativeness, recruitment, and follow-up, with associated reduced trial costs. However, they are best suited to assess endpoints reliably tracked in administrative databases, such as all-cause mortality.
Data Monitoring for Efficacy, Safety, and Futility
Most major randomized trials require interim analyses of the accumulating outcome data by treatment group. Such unblinded interim analyses are produced by an independent statistician and are evaluated by an independent DMC, comprising of several clinicians plus a statistician, all of whom have no other involvement in the trial and operate under strict confidentiality (37,38).
The main DMC responsibility is to protect patient safety, that is, to identify and react to any evidence of harm occurring to patients, especially on the new treatment. Adverse events may relate to pre-defined safety issues (e.g., bleeding on antiplatelet drugs), unexpected event types, or inferiority with regard to primary or secondary event outcomes. The DMC should meet regularly so that any ethical concerns regarding potential harm can be dealt with in a timely fashion. If safety issues become evident, the DMC may request more data analyses and schedule follow-up meetings more frequently. The DMC can recommend to the study leadership that the trial be stopped or altered. However, given the likelihood of chance variations in repeated looks at accumulating data, major alterations should only be recommended if truly convincing evidence of harm is present, with a lower threshold to modify or stop the trial for concerns relating to increased mortality, as opposed to other endpoints.
A second DMC responsibility may be to evaluate whether there is overwhelming evidence for superiority of the new treatment, which is sufficiently convincing to merit stopping the trial early. However, trials that are stopped early tend to overestimate true treatment effects. Thus, early trial stoppage should only be recommended for situations in which continuing would truly place the control group patients at harm (e.g., increased mortality, resulting in an ethical imperative to unblind and expedite approval of the experimental treatment).
Sometimes there is a third futility issue for the DMC to consider. That is, does the accumulating evidence indicate that the new treatment lacks efficacy? If there is little chance of the trial achieving a clinically-relevant positive outcome, the trial may be stopped early for futility. Such a decision needs careful consideration, as even if the primary endpoint lacks efficacy, secondary endpoints with real clinical value may emerge as positive (even if only hypothesis-generating).
A further DMC responsibility is to look at trial quality issues. For instance, if problems with noncompliance, missing visits/data, or slowness in event adjudication are evident, the DMC should provide feedback to the study leadership to facilitate improvements.
After every interim report and meeting, the DMC needs to promptly communicate its recommendations to the trial’s principal investigator (e.g., the chair of the Executive Committee) or, sometimes, directly to the trial sponsor in writing (or sooner by phone, if major issues of patient safety are apparent).
All DMC-related activities should be documented in a DMC Charter (39). This should include any statistical stopping guidelines (40), recognizing that these are not formal rules; the recommendation to stop rests on the wise judgment of the DMC on the basis of the totality of evidence at their disposal, both within the trial and externally. Note that the DMC only makes recommendations: any decisions on stopping or modifying the trial are the responsibility of the trial Executive Committee or sponsor. So, what makes for sensible statistical stopping guidelines?
First, stopping for superiority of a new treatment requires proof beyond a reasonable doubt. For example, a p value <0.001 is often used, or even a p value <0.0001 at a relatively early interim analysis. Furthermore, it is wise not to look too early or too often for superiority: 2 or 3 interim looks should suffice. For instance, the PARADIGM-HF (Prospective Comparison of ARNI [Angiotensin Receptor–Neprilysin Inhibitor] with ACEI [Angiotensin-Converting–Enzyme Inhibitor] to Determine Impact on Global Mortality and Morbidity in Heart Failure) trial of LCZ696 versus enalapril in chronic heart failure required a p value <0.001 for both the composite primary endpoint (CV death or hospitalization for heart failure) and CV death alone at its second interim analysis, when two-thirds of primary events had occurred (41). Both boundaries were crossed, and the DMC duly recommended stopping.
Of note, achieving a statistical guideline does not automatically mean the trial is stopped. For instance, in the SHIFT (Systolic Heart Failure Treatment with the If Inhibitor Ivabradine Trial) trial of ivabradine versus placebo, superiority was present at the second planned interim analysis for both the composite primary endpoint (CV death and hospitalization for heart failure) and all-cause death: p <0.0001 and p = 0.0014, respectively (42). The pre-defined stopping boundary was a p value <0.001 for the primary endpoint. However, the DMC recommended continuation: there were only a few months to go to complete enrollment, important subgroup issues needed resolving, event adjudication was incomplete, and a previous related trial (BEAUTIFUL [Morbidity-Mortality Evaluation of the If Inhibitor Ivabradine in Patients with Coronary Disease and Left-Ventricular Dysfunction]) had been neutral (43). Upon trial completion, the primary endpoint finding was confirmed, but all-cause mortality was no longer significant (p = 0.09). Such “regression to the truth” may often arise. That is, interim findings that cross a stopping boundary may be “on a random high,” so that subsequent results (if the trial continues) may end up less impressive (44).
Second, stopping for futility has 2 types of statistical guidelines (40,45). One approach is to see if the 95% CI for the primary endpoint effect estimate excludes a pre-declared minimum benefit, and then stop the trial early. For instance, in the PERFORM (Prevention of Cerebrovascular and Cardiovascular Events of Ischaemic Origin with Terutroban in Patients with a History of Ischaemic Stroke or Transient Ischaemic attack) trial, the primary endpoint was the composite of CV death, MI, or ischemic stroke (46). At the 20th safety report, the hazard ratio was 1.04 (95% CI: 0.95 to 1.14). This excluded the pre-defined 7% benefit (i.e., a hazard ratio of 0.93), and so the DMC recommended that the trial be stopped for futility.
An alternative approach uses conditional power: that is, if the interim data indicate only a slim chance of achieving statistical significance upon trial completion, then stopping early for futility may be reasonable. This method was applied in the RED-HF (Reduction of Events by Darbepoetin Alfa in Heart Failure) trial of darbepoetin alfa versus placebo in heart failure patients with anemia (47). Futility was considered at each interim analysis: if the conditional power under the protocol-specified hazard ratio of 0.8 for the composite primary endpoint (death or heart failure hospitalization) was <30%, then the DMC could recommend the trial be stopped. This boundary was eventually crossed, but the DMC decided to allow the trial to continue: there were no safety concerns and there were significant quality of life improvements (a secondary endpoint).
Third, stopping for safety usually requires more frequent looks at interim data, because there is an ethical obligation to stop promptly if a new treatment is causing harm (48). Also, the stopping boundary needs to be less stringent; for example, a p value <0.01 going the wrong way for the primary endpoint or all-cause mortality is a useful simple guideline. For instance, in the ILLUMINATE (Investigation of Lipid Level Management to Understand its Impact in Atherosclerotic Events) trial of torcetrapib versus placebo in high-risk patients, the DMC observed 82 deaths in the treatment arm versus 51 deaths with control (p = 0.007), which was the prime reason for stopping the trial for harm (49). As a consequence, the sponsor withdrew the drug immediately from any further investigation worldwide.
Similarly, the PALLAS (Permanent Atrial Fibrillation Outcome Study Using Dronedarone on Top of Standard Therapy) trial of dronedarone versus placebo in permanent atrial fibrillation was stopped early when both coprimary endpoints of: 1) stroke, MI, systemic embolism, or CV death; and 2) unplanned hospitalization for a CV cause or death, demonstrated an excess on dronedarone (both p < 0.01) (50). This was particularly surprising, given that the earlier ATHENA (A Placebo-Controlled, Double-Blind, Parallel Arm Trial to Assess the Efficacy of Dronedarone 400 mg bid for the Prevention of Cardiovascular Hospitalization or Death from Any Cause in Patients with Atrial Fibrillation/Atrial Flutter) trial of dronedarone in nonpermanent/paroxysmal atrial fibrillation had shown a highly significant benefit (51). This illustrates the importance of the safety role of a DMC, no matter how promising the prior evidence from other sources.
Stopping early for harm may relate not to the efficacy endpoints, but to specific safety problems instead. For instance, at an early interim report, the APPRAISE 2 (Apixaban for Prevention of Acute Ischemic Events 2) trial of apixaban versus placebo in ACS patients showed significant increases in major bleeding events for those on apixaban (52). Numbers of events were small, but given that the primary efficacy endpoint of CV death, MI, or ischemic stroke had thus far showed no benefit, this safety signal was deemed sufficient to halt the trial. In such scenarios of potential harm, it is difficult to have a statistical stopping guideline that adequately captures the ethical concern, which needs balancing against potential benefit regarding efficacy endpoints. Such matters depend on an experienced DMC acting wisely, being fully aware of the ethical and practical consequences of its actions.
Let us conclude this section with potential stopping guidelines for a planned placebo-controlled trial of a new drug for patients at high CV risk. The trial is to recruit 13,000 patients, and completion is planned when 1,600 primary major adverse CV events have occurred, anticipated to take >5 years duration in total. This gives 90% power to detect a 15% risk reduction (i.e., hazard ratio: 0.85). The trial plans to have 2 interim analyses, after 50% and 75% of primary events have occurred, and the proposed stopping boundaries for superiority and for futility are shown in Table 2.
First, the timing of these boundaries recognizes that stopping for either superiority or futility should not be contemplated before at least one-half of the trial’s evidence has accumulated. The superiority guideline (p < 0.0002) reflects the spirit of only stopping when there is overwhelming evidence. It is interesting to note that to stop early, the hazard ratios for the major adverse CV event primary endpoint at the 2 interim looks would need to be <0.768 and <0.806, respectively, considerably more beneficial than the hazard ratio of 0.85 used in the power calculation. Given the tough stopping boundary, the final p value <0.05 for a positive outcome is not compromised, and with 1,600 primary events, an observed hazard ratio <0.906 would reach statistical significance.
The stopping guidelines for futility in Table 2 are on the basis of conditional power calculations. With 50% of the event data in (800 primary endpoint events), if the hazard ratio is only very slightly in a positive direction (hazard ratio >0.979) or in the opposite direction, then the trial may stop for futility. By adding a further 25% of events at the second interim analysis (1,200 events), one would need a somewhat stronger indication of treatment benefit to continue: hazard ratio >0.931 is considered sufficient to stop for futility. Note these are not intended as absolute rules. There may be other issues (secondary endpoints, safety concerns, subgroup findings, or external evidence) that could sway the totality of evidence in a positive or negative direction.
Last, note the lack of any formal stopping boundaries for safety. Experience dictates that it is impractical to capture all the scenarios and nuances of potential harms in statistical guidelines. Rather, the trial DMC will receive frequent safety reports every few months and will collectively make judgments on the strength of evidence and the absolute magnitude and seriousness of any safety signals.
The conventional wisdom in clinical trial design is that once the study protocol is finalized, the trial should proceed with no further changes to its intent. Protocol amendments are permitted under certain circumstances, but should be made without knowledge of interim results by treatment groups: that is, the DMC should have no involvement in such changes. Such amendments may be of a practical nature, for example, clarifications of patient eligibility, endpoint definitions, or drug dose modifications. Amendments in response to knowledge of ongoing blinded results for all treatments combined are also permitted. For instance, if the incidence of the primary endpoint pooled across randomized groups is substantially lower than anticipated, the target sample size might be increased, the eligibility criteria might be changed to recruit higher-risk patients, the duration of follow-up might be prolonged, or the primary endpoint might even be altered (e.g., by expanding a composite to include additional types of outcomes). In principle, such adaptations are acceptable and carry no statistical penalties, although they may prompt concerns that someone involved had an awareness of unblinded results. In particular, changing the primary endpoint often evokes suspicion, even if unwarranted.
An emerging and more controversial type of adaptive design is where protocol changes are made on the basis of the unblinded interim results (53,54). Both European and U.S. regulators have issued guidance on the use (and possible misuse) of such adaptations (55,56). It is key that any such potential changes should be pre-defined in an Adaptive Charter, that they should not affect the trial’s overall integrity, and that they should preserve statistical rigor: that is, an unbiased verdict is still reached on the treatments’ relative merits.
The most common adaptation using unblinded data concerns sample size re-estimation (57). Other types of proposed adaptive designs (54) include seamless phase II/III trial designs, whereby from multiple new treatments (e.g., different drug doses), one drops some arms at an interim analysis on the basis of a surrogate outcome, thereafter examining clinical outcomes (58); enrichment designs, in which after the adaptation, selected subgroups of patients are preferentially enrolled in whom the event rates were observed to be high or evidence of treatment effect appeared particularly robust (59); and “play the winner,” whereby the randomization ratio is adjusted to put a higher proportion of future patients on the treatment with better interim results (60). All have a methodological appeal, but introduce logistical and interpretive challenges.
Hence, we now concentrate on adaptive sample size re-estimation. The logic is that if the observed treatment difference for the primary endpoint at a pre-planned interim analysis is somewhat smaller than that assumed in the original power calculation, trial size may be increased to provide adequate power to detect such a more modest treatment effect. For this approach to be valid, the interim results need to be in a “promising zone,” that is: 1) the observed interim treatment difference, although smaller than hoped for, is still trending in the right direction and is big enough to be of clinical relevance; and 2) the expansion in sample size takes the conditional power from a current 50%+ to a desired 80% or higher. Then, the type I error may be preserved without any statistical adjustments. A sample size increase could also be considered if the effect size is preserved, but the endpoint rates at the interim analysis are lower than anticipated.
Figure 2 gives a conceptual outline of how adaptive sample size re-estimation could work. Suppose an interim analysis is performed after one-half of the original trial’s results are known and that you are prepared to increase the size (if necessary) up to double that originally planned. Then, whether to make any size increase depends on how the observed treatment difference compares with the pre-planned treatment difference used in the original power calculation. If these rates are at least similar, then the trial is “on track” and there is no need to increase trial size. We call this the favorable zone; in Figure 2, this extends to point B, where the observed difference is approximately 90% of pre-planned difference A.
The promising zone refers to the scenario where the observed difference is less than hoped for, but conditional power can still be raised to the desirable 90% by increasing the sample size. This works fine if the observed difference is at least 66% of the pre-planned difference (point C in Figure 2) for which a doubling of size is needed.
One can then extend the promising zone into less optimistic territory, where a doubling is still to be done, even though the conditional power cannot make it to the desired 90%. For instance, point D in Figure 2 occurs when the interim difference is only slightly more than one-half of the pre-planned difference. Doubling the trial size can raise the conditional power up to more than 50%, which is not hopeless, but a gamble as to whether the trial will end up positive. If the interim results are worse than that, then one is in the unfavorable zone. The trial then continues to its original size (unless findings are very unfavorable, in which case stopping for futility may be considered). Note that Figure 2 is just conceptual: precise statistical details would need to be calculated (57) and specified in an Adaptive Charter.
Pre-planned adaptive sample size re-estimation has been used in 2 trials of cangrelor versus clopidogrel in PCI patients. In the CHAMPION (Cangrelor versus Standard Therapy to Achieve Optimal Management of Platelet Inhibition) PCI trial, after 70% of patients were enrolled, an interim analysis of the 48-h primary endpoint was performed to determine whether the intended sample size (n = 9,000) needed expanding up to a maximum of 15,000 (61,62). The Adaptive Charter also considered potential enrichment with more diabetic, troponin-positive, or clopidogrel-naive patients if it would enhance statistical power. Unfortunately, there was no interim evidence that cangrelor was superior to clopidogrel, and the trial was stopped early for futility.
The more recent CHAMPION PHOENIX trial also planned for adaptive sample size re-estimation; but, in this instance, because the interim analysis showed clear evidence of cangrelor’s superiority, there was no need to expand beyond the original sample size target of 10,900 patients (63).
These 2 examples provide a reality check to the burgeoning enthusiasm some trialists express about adaptive designs. If a trial is well planned, with a realistic size and alternative hypothesis, then the “promising zone” needing actual expansion of trial size is a relatively narrow window of opportunity. We favor incorporating pre-planned adaptive sample size re-estimation into clinical trial designs, but investigators should realize that the likelihood of actively changing the study size or patient eligibility composition (enrichment) is modest. Thus, organizational simplicity, rather than complex statistical algorithms, is recommended. Also, these calculations can be nuanced, and a statistician experienced in adaptive design methodology should be involved.
Despite these caveats, small biotechnology or medical device companies that do not have the initial resources to plan an appropriately large trial upfront often consider an adaptive approach. Thus, they start with a smaller trial with a potentially unrealistic treatment-effect size, and then use “positive” interim data to persuade funders to expand the trial. This raises an important concern about adaptive designs: the implicit leaking of interim findings beyond the strict confidentiality of the DMC. Only the adaptive decision makers should be privy to interim results. If the rationale for a trial’s adaptive increase in size is known, people will infer the nature of the interim findings. It is a matter of debate as to whether such wider leakage compromises the trial’s integrity (e.g., by altering patient recruitment patterns).
The Central Illustration summarizes the key issues in the diverse collection of design topics we have tackled. In this series of 4 consecutive papers on clinical trials (2 on analysis and reporting, 2 on design) the aim has been to cover those statistical and scientific issues of importance, with a focus on practical insights of relevance to cardiologists.
There is a substantial number of published papers of a more technical nature that statisticians need to master, but such issues tend to be secondary in importance compared with grasping the essential nontechnical factors we have discussed, many of which represent the application of common sense to trial design and statistics. There were some topics we chose not to tackle. For example, Bayesian methods are absent, partly because it is hard to do them justice in a few pages, but also reflecting our view that they have a limited role: there is a paucity of examples where their use in cardiology trials achieved insights not reachable by conventional methods.
It is our hope that this series may help clinical trialists and sponsors to more effectively design studies, statisticians interfacing with study leadership to bring forward the most relevant issues to jointly address, and cardiologists to critically interpret and appraise published studies so as to effectively translate clinical trial evidence to patient care.
The authors have reported that they have no relationships relevant to the contents of this paper to disclose.
- Abbreviations and Acronyms
- acute coronary syndrome
- confidence interval
- Data Monitoring Committee
- myocardial infarction
- optimal medical therapy
- percutaneous coronary intervention
- Received September 24, 2015.
- Revision received October 25, 2015.
- Accepted October 25, 2015.
- 2015 American College of Cardiology Foundation
- ↵Food and Drug Administration, Center for Drug Evaluation and Research (CDER), Center for Biologics Evaluation and Research (CBER). Guidance for industry: non-inferiority clinical trials. Draft guidance. 2010. Available at: http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm202140.pdf. Accessed November 2, 2015.
- Brown M.J.,
- Palmer C.R.,
- Castaigne A.,
- et al.
- Stone G.W.,
- Rogers C.,
- Hermiller J.,
- et al.,
- for the FilterWire EX Randomized Evaluation (FIRE) Investigators
- ↵Hoffman-LaRoche. A study of roactemra/actemra (tocilizumab) in comparison to etanercept in patients with rheumatoid arthritis and cardiovascular disease risk factors. 2015. Available at: https://clinicaltrials.gov/ct2/show/NCT01331837. Accessed November 2, 2015.
- ↵Food and Drug Administration, Center for Drug Evaluation and Research (CDER). Guidance for industry: Diabetes mellitus—evaluating cardiovascular risk in new antidiabetic therapies to treat type 2 diabetes. 2008. Available at: http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm071627.pdf. Accessed November 2, 2015.
- ISIS-4 (Fourth International Study of Infarct Survival) Collaborative Group
- Pocock S.J.,
- Gersh B.J.
- ↵New York University School of Medicine. International Study of Comparative Health Effectiveness With Medical and Invasive Approaches (ISCHEMIA). 2015. Available at: https://clinicaltrials.gov/ct2/show/NCT01471522. Accessed November 2, 2015.
- Stone G.W.,
- Hochman J.S.,
- Williams D.O.,
- et al.
- Palmerini T.,
- Benedetto U.,
- Biondi-Zoccai G.,
- et al.
- Windecker S.,
- Stortecky S.,
- Stefanini G.G.,
- et al.
- Ellenberg S.S.,
- Fleming T.R.,
- DeMets D.L.
- Zannad F.,
- Gattis Stough W.,
- McMurray J.J.V.,
- et al.
- ↵Committee for Medicinal Products for Human Use (CHMP). Reflection paper on methodological issues in confirmatory clinical trials planned with an adaptive design. European Medicines Agency. 2007. Available at: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003616.pdf. Accessed November 2, 2015.
- ↵Food and Drug Administration, Center for Drug Evaluation and Research (CDER), Center for Biologics Evaluation and Research (CBER). Guidance for industry: adaptive design clinical trials for drugs and biologics. Draft guidance. 2010. Available at: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM201790.pdf. Accessed November 2, 2015.
- Mehta C.,
- Gao P.,
- Bhatt D.L.,
- et al.