# Nonproportional Hazards for Time-to-Event Outcomes in Clinical Trials*JACC* Review Topic of the Week

*JACC*Review Topic of the Week

## Author + information

- Received June 13, 2019
- Revision received August 20, 2019
- Accepted August 26, 2019
- Published online October 14, 2019.

## Author Information

- John Gregson, PhD
^{a},^{∗}(John.gregson{at}lshtm.ac.uk), @GreggWStone, - Linda Sharples, PhD
^{a}, - Gregg W. Stone, MD, PhD
^{b},^{c}, - Carl-Fredrik Burman, PhD
^{d}, - Fredrik Öhrn, PhD
^{d}and - Stuart Pocock, PhD
^{a}

^{a}Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, United Kingdom^{b}Population Health Sciences and Policy, Icahn School of Medicine at Mount Sinai, New York, New York^{c}The Cardiovascular Research Foundation, New York, New York^{d}Statistical Innovation, Data Science and Artificial Intelligence, Research and Development, AstraZeneca, Gothenburg, Sweden

- ↵∗
**Address for correspondence:**

Dr. John Gregson, Department of Medical Statistics, London School of Hygiene and Tropical Medicine, Keppel Street, WC1E 7HT, London, United Kingdom.

## Central Illustration

## Highlights

• Trials with time-to-event outcomes are usually analyzed using PH models, but with non-PH this may not be the best choice.

• Restricted mean survival time can be a useful alternative with an early treatment effect.

• Milestone analysis and RMST may be useful with an early effect that attenuates later. Accelerated failure time models are a further alternative.

• The design and analysis of trials should consider how to handle non-PH.

## Abstract

Most major clinical trials in cardiology report time-to-event outcomes using the Cox proportional hazards model so that a treatment effect is estimated as the hazard ratio between groups, accompanied by its 95% confidence interval and a log-rank p value. But nonproportionality of hazards (non-PH) over time occurs quite often, making alternative analysis strategies appropriate. This review presents real examples of cardiology trials with different types of non-PH: an early treatment effect, a late treatment effect, and a diminishing treatment effect. In such scenarios, the relative merits of a Cox model, an accelerated failure time model, a milestone analysis, and restricted mean survival time are examined. Some post hoc analyses for exploring any specific pattern of non-PH are also presented. Recommendations are made, particularly regarding how to handle non-PH in pre-defined Statistical Analysis Plans, trial publications, and regulatory submissions.

- clinical trials
- Cox proportional hazards
- nonproportional hazards
- statistics
- time-to-event outcomes
- trial design

Clinical trials in cardiovascular disease often involve a time-to-event outcome, whereby patients are followed up from randomization until the occurrence of a cardiovascular event or the end of the study. In such trials, a hazard ratio estimated from a Cox proportional hazards (PH) model is often reported as the main measure of treatment effect. The hazard ratio is the ratio of the event rate at any given time in the treatment group relative to the control group, and in the Cox model, the hazard ratio is assumed to remain the same throughout follow-up. However, in some instances, particularly where experimental and control treatments are very different, this is unlikely to be true. For example, in surgical trials, a more aggressive or invasive strategy is sometimes associated with a higher early procedural risk but a lower long-term risk (1,2). In drug trials, the effect of treatment may not materialize until several months, or even years, after treatment initiation (3,4). Non-PH describes situations such as these where the hazard ratio is not constant over time. In such cases, an overall hazard ratio may not be the most informative summary of the treatment effect and alternative methods of analysis may be more suitable. However, there is a lack of practical guidance as to the best methods to assess the PH assumption and which methods to use for analysis for the various types of non-PH. We therefore applied several methods for analysis to 4 clinical trials in cardiovascular disease. We describe the types of non-PH that occur and discuss the pros and cons of each method. We conclude with recommendations on how to prepare for and then tackle non-PH in future trials.

## Methods

### Data

We used data from 4 clinical trials with time-to-event outcomes, chosen as key examples of the different types of treatment effects over time that can occur.

The ASCOT (Anglo-Scandinavian Cardiac Outcomes Trial) was a randomized trial evaluating 2 experimental treatments using a 2 × 2 factorial design (5,6). At baseline, 19,257 patients were randomized to 1 of 2 antihypertensive treatments: an amlodipine-based regime or an atenolol-based regime. This part of the trial is called the ASCOT-BPLA (ASCOT Blood-Pressure Lowering Arm) (6), and our analysis of ASCOT-BPLA used the secondary endpoint of cardiovascular mortality. A subset of 10,305 patients with nonfasting total cholesterol concentrations of at least 6.5 mmol/l were also randomized to either atorvastatin or placebo. This part of the trial is called the ASCOT-LLA (ASCOT Lipid-Lowering Arm) (5). Our analysis of ASCOT-LLA used the secondary endpoint of total coronary events.

The CHARM (Candesartan in Heart Failure—Assessment of Reduction in Mortality and Morbidity) program randomized 7,599 patients with chronic heart failure to candesartan or placebo in 3 randomized trials assessing the impact of candesartan on time to first heart failure hospitalization or cardiovascular death (7). We analyzed data from the CHARM-Overall program, which included patients from all 3 trials, had a primary outcome of all-cause death, and a median follow-up of 3.1 years.

The EXCEL (Evaluation of Xience Versus Coronary Artery Bypass Surgery for Effectiveness of Left Main Revascularization) trial randomized 1,905 eligible patients with left main coronary artery disease to either the coronary artery bypass grafting (CABG) group or the percutaneous coronary intervention (PCI) group with fluoropolymer-based cobalt–chromium everolimus-eluting stents (1). The primary outcome, and the focus of our analyses, was a composite outcome of death, stroke, or myocardial infarction during the first 3 years of follow-up.

### Statistical methods

#### Identifying non-PH

To assess whether there is a statistically significant deviation from PH, we fit an interaction between the estimate of the (log) hazard ratio and time (modeled as a linear covariate), as suggested by Cox (8). Several further methods (e.g., Grambsch-Therneau test) have been suggested for detection of non-PH, but none are clearly superior (9). Graphically assessing the extent of non-PH can also be useful. We used plots of the smoothed scaled Schoenfeld residuals against follow-up time, which show a smoothed estimate of the log hazard ratio against follow-up time (10). If the PH assumption is true, then the underlying log hazard ratio is constant over time, and so an approximately horizontal line is expected in the plot.

#### Estimation methods

We used 4 statistical methods for the estimation of the overall treatment effect (Central Illustration, Figure 1). Our first analysis used the Cox PH model to estimate the hazard ratios associated with treatment.

Our second analysis used an accelerated failure time model to estimate time ratios associated with treatment. The time ratio describes the estimated delay until an event occurs with treatment relative to the control group. For example, a time ratio of 2 would mean that the time until an event occurs is twice as long in the treatment group relative to the control group, everything else being equal. There are several types of accelerated failure time models; we used the log-logistic model for the baseline hazard function, because unlike some other models (e.g., exponential) it is not restricted to PH.

Our third analysis estimated the difference in the percentage of patients with an event in the treatment group compared with in the control group at a fixed time since baseline, known as the milestone time. We refer to this analysis as a milestone analysis. To have credibility, the chosen milestone time should be pre-specified before data analysis. The percentage of patients with an event in each group was estimated using the Kaplan-Meier method, and the Greenwood formula was used to estimate standard errors. This method is similar to using logistic regression or calculating the odds ratio associated with treatment, but also accounts for loss to follow-up. The choice of milestone time is an important aspect in such analyses. In the ASCOT-LLA, CHARM, and EXCEL trials, we chose a milestone time of 3 years, close to the median follow-up times in each of these studies (3.3 years, 3.1 years, and 3.0 years, respectively). A milestone time of 5.5 years was chosen in ASCOT-BPLA (median follow-up 5.5 years).

Our fourth analysis estimated the difference in restricted mean survival time (RMST) between groups, up until a fixed milestone time. Survival here refers to event-free survival (i.e., the absence of an outcome event), rather than simply continuing to be alive. The RMST in each group is the mean time spent free from an outcome event in each group up until the milestone time, after adjusting for loss to follow-up. The RMST difference can be represented as the difference in areas under the Kaplan-Meier plots for each group (Figure 1). Following the advice of Royston and Parmar (11), we modeled event-free survival separately in each of the treatment and control groups using a flexible parametric survival model with 3 degrees of freedom (except in the EXCEL trial, wherein we used 2 degrees of freedom to achieve model convergence). RMST can also be calculated using nonparametric methods (12).

For each method, we estimated the appropriate treatment effect, its 95% confidence interval (CI) and a p value from the corresponding hypothesis test.

In addition, we used some methods more suitable for post hoc analyses. We used piecewise hazards models, whereby time since baseline was split into segments, and hazard ratios were calculated separately for each period of time by applying a Cox PH model within each period.

Finally, we assessed the number of patients that would be required to achieve 80% power under each analysis method, assuming the observed time pattern of treatment difference is the truth. The standard error of each estimated treatment effect is approximately inversely proportional to the square root of the sample size, and using this relationship, we calculated the approximate sample size required to achieve 80% power (see the Online Appendix). Analyses were done in Stata version 15.1 (Stata Corp., College Station, Texas); flexible parametric models were implemented in the stpm2 package.

## Results

The cumulative incidence of events in each treatment group is shown for each of the 4 studies (Figure 2), with the pattern of treatment effect appearing to differ in each study. In the ASCOT-LLA trial (Figure 2A) there was a steady divergence between cumulative incidence curves over time. This pattern is typical when PH are a reasonable assumption. In the ASCOT-BPLA trial (Figure 2B), cumulative incidence curves in each group were very similar for the first 2.5 years of follow-up and then gradually diverged, an example of a delayed treatment effect. Conversely, in the CHARM trial (Figure 2C), the cumulative incidence curves diverged very early during follow-up but then ran parallel to one another after 6 months. This pattern is referred to as an early effect. Finally, in the EXCEL trial (Figure 2D), the curves diverged early on, but the early effect of treatment was not maintained, with the cumulative incidence curves converging later, a pattern we term a diminishing treatment effect.

We applied each of our estimation methods to each of these 4 studies.

### ASCOT-LLA

In the ASCOT-LLA trial, total coronary events were compared among patients receiving atorvastatin or placebo. The treatment effect observed was consistent with PH: there was an approximately horizontal line throughout follow-up in the plot of Schoenfeld residuals, and the test for a treatment-time interaction was nonsignificant (p = 0.90) (Online Figure 1).

We found a statistically significant reduction in total coronary events regardless of the method used for analysis (Table 1). The hazard ratio of 0.71 from a Cox PH model and the time ratio of 1.51 from an accelerated failure time model have similar interpretations. In the former, the hazard for coronary events is 29% lower with atorvastatin, and in the latter, the time until a coronary event is delayed by 51%. In the milestone analysis, the 3-year event rate was 1.3% lower with atorvastatin and the RMST difference estimated that on average a patient was event-free 7.6 days longer with atorvastatin in the 3 years after randomization.

Both the milestone analysis and RMST difference estimate an absolute effect of treatment, whereas the Cox and accelerated failure time models estimate a relative effect (hazard ratios or time ratios, respectively). However, when the PH assumption is valid, milestone analysis and RMST have disadvantages. Data collected after the milestone time are ignored, resulting in a loss of power, and so more patients would be required in a trial using these methods as the primary analysis (Table 2). The choice of milestone time is somewhat arbitrary, but it is an important feature of the study design that can influence both clinical interpretation and statistical power. If set too early, many events will be excluded. But if set too late, when few patients remain at risk, events occurring near the end of the study may exert an unduly large influence on the study results.

### ASCOT-BPLA

Amlodipine was compared with atenolol in the ASCOT-BPLA trial. There was clear evidence of non-PH (p = 0.0013) (Online Figure 1). A highly significant reduction in cardiovascular death was seen with all methods except for RMST. The hazard ratio comparing amlodipine to atenolol was 0.76 (p = 0.0012), the time ratio was 1.29 (p = 0.0010), and the estimated reduction in cardiovascular deaths by 5.5 years was 0.8% (p = 0.0019) (Table 1). However, the estimated RMST difference of 2.9 days was not statistically significant (p = 0.31). When a delayed treatment effect is present, there is a large reduction in statistical power if RMST is chosen as the primary analysis. A trial using RMST would require many more patients than a trial using any of the other methods (Table 2). This is because RMST analyses are most strongly influenced by survival early during follow-up, when there is little difference between treatment groups. Furthermore, a careful interpretation is warranted of the estimated RMST difference of 2.9 days because it is plausible that the differences in event-free survival would continue to accrue beyond the milestone time. For example, suppose we were to take a longer-term perspective and consider RMST difference at 10 years. Suppose also that the risk of cardiovascular deaths between 5.5 years and 10 years were the same in both groups so that there is no further benefit or harm related to treatment after the milestone time. We would then expect the cumulative incidence curves for each group to run approximately parallel to one another between 5.5 and 10 years. The RMST difference (which can be visualized as the area between the cumulative incidence curves) would therefore continue to accrue and would be greater at 10 years than it was at 5.5 years (∼16.2 days vs. 2.9 days) (Online Figure 2).

### CHARM

In the CHARM-Overall study, in which the effect of candesartan was compared with placebo on all-cause death, there was strong evidence against PH (p value for treatment-time interaction = 0.009). There was an apparent early effect that lasted only for the first 6 to 18 months following randomization (Figure 2C, Online Figure 1). Evidence against PH was even stronger when comparing the hazard ratio before 6 months to the hazard ratio thereafter (0.59 vs. 1.00; p value for interaction = 0.0001). In analyses of the effect of candesartan on mortality, the p value was close to 0.05 for all methods except for the RMST difference, where there was strong evidence that mean 3-year survival was longer with candesartan (21.0 days, p = 0.0008). The Cox model (hazard ratio: 0.91; p = 0.055) failed to demonstrate a treatment benefit, whereas results from an accelerated failure model (time ratio: 1.11; p = 0.032) and a milestone analysis (1.95% 3-year reduction in deaths, p = 0.044) were both just statistically significant. RMST difference will generally be more statistically powerful than the other methods with an early treatment effect (Table 2). In general, any treatment difference in early events has a greater influence on the RMST difference than events occurring later. This can be visualized by considering the Kaplan-Meier plot in Figure 2C, in which a gap between the curves opens up before 6 months, and the area between the 2 curves continues to accumulate with time. In contrast, the relative importance of early and late events is broadly similar with the other 3 methods so that the similar event rate later in follow-up had a greater diluting influence on the apparent early treatment benefit.

Data from the CHARM study also demonstrate the sensitivity of findings to the choice of milestone time. Although there was some evidence of a difference in 3-year survival, had we chosen a different milestone time, for example, 34 or 38 months, the difference would not have been statistically significant (p = 0.093 and p = 0.096, respectively). On the other hand, if one takes a short-term perspective with the milestone set at 6 months, then the mortality difference is highly significant (4.9% vs. 2.9%, p < 0.0001). It is also worth noting that for treatment effects that decrease over time, the exclusion of data after the milestone time does not lead to a loss in statistical power, because including later deaths would further dilute the significance of the early treatment effect.

### EXCEL

The EXCEL study compared a composite outcome of stroke, myocardial infarction, or death in patients with left main coronary artery disease treated with PCI or CABG. There was a much higher procedural risk of the composite outcome with CABG (7.9% within 30 days) than with PCI (4.9%), but by 3-year follow-up, the proportions of patients with the composite outcome were similar for CABG (14.7%) and PCI (15.4%) (Figure 2D). Unsurprisingly, therefore, there was very strong evidence against the PH assumption with the estimated log hazard ratio differing markedly over follow-up (p for treatment-time interaction = 0.003) (Online Figure 1).

None of the main methods for estimation demonstrated a clear treatment benefit for either intervention, although there were notable differences between the methods. Results from a Cox model (hazard ratio for PCI vs. CABG: 1.01; p = 0.97) or from an accelerated failure time model (time ratio: 1.06; p = 0.88) do not provide much insight. Naively interpreted, these estimates indicate a lack of difference between groups, whereas the 2 treatments clearly differ in the timing of the risk of outcomes occurring with each intervention. The underlying assumption of PH used in a Cox model and assumption of a constant time ratio in the accelerated failure time model were clearly not satisfied (treatment-time interaction p < 0.001 for both). The milestone analysis at 3 years for the percentage with the primary outcome (treatment difference +0.5%; 95% CI: −2.7 to 3.7) is readily interpreted and does not make any modeling assumptions, but it fails to take into account the difference in the timing of events during follow-up. The RMST difference is perhaps the most useful of the 4 methods for summarizing the data from the EXCEL study. It takes into account the fact that although the total number of events was similar in the 2 groups, they tended to occur later in the PCI group, thereby lengthening the time a patient was event-free. The estimated gain in event-free survival up to 3 years is 18.3 days (95% CI: −11.1 to 47.8), but the difference is not statistically significant.

To further understand the results from the EXCEL trial, we performed 3 sets of post hoc analyses. Because the primary outcome was a composite of clinically heterogeneous events, we present cumulative incidence curves separately for each component (Online Figure 3). The lower procedural risk with PCI is largely due to a reduction in myocardial infarction and stroke, rather than death. If considered alongside the patterns of the individual events through time, this analysis may be helpful in suggesting how future event rates might differ in the 2 groups. For example, it may suggest whether future Kaplan-Meier curves will continue to converge, crossover, or progress in parallel. We next used piecewise hazards models, where we split follow-up time into 3 segments representing procedural, mid-term, and long-term follow-up, calculating hazard ratios separately within each segment. The hazard ratios for PCI versus CABG were 0.61 (95% CI: 0.42 to 0.88) within 30 days of randomization, 1.05 (95% CI: 0.64 to 1.70) from 30 days to 1 year, and 1.93 (95% CI: 1.25 to 2.97) from 1 year to 3 years (Figure 3A). This simple approach can provide useful insight into the underlying patterns of risk. Finally, we generated additional diagnostic graphical displays (Figures 3B and 3C). Figure 3B shows the difference in event-free survival estimates and 95% CI throughout follow-up, which is equivalent to performing a milestone analysis at each day during follow-up. This visually demonstrates that the early benefit of PCI is gradually eroded over time by an increased post-procedural risk. Figure 3C shows the difference in mean event-free survival time over study follow-up. The upward trend of the curve shows that the early benefit due to reduced procedural PCI risk has an effect on RMST out to nearly 3 years. For any choice of milestone time in the range up to 2 years, the treatment difference in RMST is statistically significant. The greater number of primary events after PCI thereafter reduces the apparent benefit, whereas the CI increases in width so that the treatment effect is no longer statistically significant.

One caution in all these post hoc analyses is that no correction is made for multiple testing, as they need to be perceived as exploratory analyses.

## Discussion

For clinical trials of time-to-event outcomes, it has become standard practice to use Cox PH models both for trial design (e.g., power calculations) and statistical analysis. However, this may not be the best approach when the effect of treatment varies over time. Our analyses of 4 cardiology trials demonstrate some alternative approaches and outline some of their advantages and disadvantages under various patterns of treatment effect.

When PH are satisfied, the Cox PH model is the most statistically powerful method, and hazard ratios are readily understood by clinicians. We therefore see little practical reason to use alternative analysis strategies as the pre-specified primary analysis when deviation from PH is not expected, despite recent critiques of the hazard ratio for estimating treatment effects (13). However, when major deviations are anticipated, it may be possible to adapt the design. In studies where an early treatment effect is anticipated, it may be possible to recruit fewer patients while maintaining adequate power by using RMST differences as the primary method of analysis rather than the Cox model. In contrast, when a delayed treatment effect is likely, RMST difference is best avoided. In addition, with a delayed effect, the sample size may need to be inflated to allow for the extra variability caused by events occurring at the beginning of the trial when there is no difference between treatment groups, as was done in the CORONA (Controlled Rosuvastatin Multinational Trial in Heart Failure) of rosuvastatin (14).

In most cases the type of treatment effect is unknown in advance, but the analysis method needs to be pre-specified. Unfortunately, there is no clear “best” method across all types of treatment effects. Although we are aware of several tests-based methods that maintain good statistical power to detect differences between treatments across a range of types of non-PH (15,16), these methods only provide a p value without an accompanying estimation method linked to the test. An example is a test based on a series of weighted log-rank tests, where some of the tests counterintuitively weight events occurring later in follow-up as more important than those occurring earlier (16). The p values from such tests indicate whether the pattern of survival differs between treatment groups, but these p values do not identify which treatment is “better” nor quantify how the difference between groups affects patient outcomes. Therefore, we do not recommend methods based only on hypothesis testing.

Non-PH can have important implications for trial design beyond the choice of analysis strategy. When treatment is associated with a lower (or higher) short-term risk that later reverses, it is important that the trial continues for sufficient duration so that the long-term effects of the treatment can be fully understood. Longer-term (i.e., 5-year) results from the EXCEL study will therefore be helpful to further understanding of the risks and benefits of PCI relative to CABG in patients with left main coronary disease. A second implication for trial design is that the stopping criteria used by data monitoring committees should take into account potential non-PH patterns of treatment effect. For instance, the CHARM program Data Safety Monitoring Board did not recommend stopping early even though a planned interim analysis of short-term mortality showed a highly significant reduction in mortality on candesartan (17). Conversely, caution would be required when stopping a trial early for futility if a delayed effect was anticipated.

Post hoc analyses of trials with non-PH can sometimes provide useful insights. A first step is to assess whether the PH assumption is reasonable. Formal statistical testing of PH is sometimes useful, but may miss clinically important deviations from PH in small studies while detecting clinically unimportant deviations from PH in large studies. Graphical displays, including Kaplan-Meier curves and Schoenfeld residuals may therefore be more useful for this purpose. The key is to determine whether a single hazard ratio captures the effect of treatment with a reasonable degree of accuracy across the entirety of patient follow-up. Clearly this is not the case when the effect of treatment reverses during follow-up or is present for only a minority of follow-up. However, it is difficult to provide precise guidance on when alternative analysis strategies are likely be useful because this will depend on many factors, including the variation in the HR over time, the frequency of events over time, the pattern of censoring, as well as clinical considerations.

When the PH assumption is not reasonable, using a piecewise hazards model can be useful. In our analysis of the EXCEL trial, it helped identify periods during which the hazard with PCI was less than, similar to, or greater than the risk with CABG. A limitation of this methodology is that the post hoc selection of time periods that appear visually different may exaggerate the real differences in hazard ratios over time. The hazard ratios calculated for later time periods also only include survivors of earlier time periods and so are not truly randomized comparisons.

A further post hoc analysis not considered here is to explore whether non-PH has arisen by the combination of clinically distinct subgroups of patients in whom the effect of treatment is different. It is possible to have non-PH overall even though the PH assumption is satisfied within subgroups. Examples where this may have occurred are present in the medical publications (18,19), although we are unaware of any convincing examples in cardiovascular trials to date. In such scenarios, subgroup analyses or stratified Cox PH models or a combination of both may be useful. Studies where patients come off treatment or crossover treatments during follow-up can lead to non-PH because treatment groups become more similar to one another over time. It is a particular problem for noninferiority trials where a dilution of the treatment effect may lead to an incorrect declaration of noninferiority. In such cases, use of per-protocol analyses, or statistical adjustment for noncompliance and crossover (complier average causal effect analyses) may help to restore PH for the true treatment effect (20). A further issue in noninferiority trials occurs when a noninferiority margin is based on a hazard ratio, but the assumption of non-PH is clearly inappropriate. In such cases, the noninferiority margin may become inappropriate or difficult to interpret and so it is important to explore alternative analysis strategies.

One major concern is how one incorporates potential non-PH into the pre-defined statistical analysis plan for a major trial. It can be hard to anticipate the existence and pattern of non-PH in a trial, so in most circumstances, the Cox PH model and associated log-rank test will be the pre-defined primary analysis. However, we would encourage statistical analysis plans to document contingency plans for an alternative primary analysis should clear evidence of non-PH be detected when the trial is unblinded for the final analysis. For instance, if clear evidence of a pattern of early treatment effect is reported, then the PH assumption is violated. An analysis using RMST could then be performed (as could have been applied to the CHARM program). Another example is in a meta-analysis of trials for oseltamivir treatment in influenza, where the pre-defined intent of Cox models was replaced by accelerated failure time models, because the former did not “fit the data,” whereas the latter did (21). However, if not pre-specified, to change the primary analysis methodology of the primary outcome in light of lack of model fit is a radical step. Further debate is needed as to when such a step is truly acceptable in the primary publication or regulatory submission or both of a major clinical trial. “How great does the departure from the PH assumption need to be?” and “how can this be done in a way that does not result in an increased probability of false positive findings?” are key questions.

Our review has limitations. First, with only 1 study for each pattern of treatment effect, the generalizability of our study may be questioned. However, our analyses are meant to be illustrative, with some of the findings, for example, those relating to the power of RMST relative to a Cox model under various patterns of treatment effect, already established in statistical publications (22). Second, we did not present data on a fifth type of treatment effect, wherein the Kaplan-Meier curves cross during follow-up (“crossover pattern”), as was observed in the STICH (Surgical Treatment for Ischemic Heart Failure) trial of CABG versus medical therapy (23). In such situations, a single effect estimate is unlikely to accurately capture the effect of treatment so the choice of appropriate statistical analyses would require careful consideration. One would need to consider the relative importance of later versus earlier events and ensure that the study continues for long enough to allow a full understanding of the effect of treatment over time.

In conclusion, serious attention needs to be given to appropriate analysis strategies when non-PH are evident in time-to-event outcomes. It is important to detect the type of non-PH that is present and select the analytical technique most appropriate to that situation. The consequences for more thorough statistical analysis plans, trial publications, and regulatory submissions need a further collective clarity of thought.

## Acknowledgments

The authors thank Daniel Jackson and Jonathan Wessen for their helpful comments on this review.

## Appendix

## Footnotes

Drs. Burman and Öhrn are employees of and hold stock in AstraZeneca. Dr. Pocock has served on the Data and Safety Monitoring Boards of the ASCOT and CHARM trials. Drs. Gregson, Stone, and Pocock are authors of the EXCEL trial. All other authors have reported that they have no relationships relevant to the contents of this paper to disclose.

**Listen to this manuscript's audio summary by Editor-in-Chief Dr. Valentin Fuster****on****JACC.org**.

- Abbreviations and Acronyms
- CABG
- coronary artery bypass grafting
- CI
- confidence interval
- PCI
- percutaneous coronary intervention
- PH
- proportional hazards
- RMST
- restricted mean survival time

- Received June 13, 2019.
- Revision received August 20, 2019.
- Accepted August 26, 2019.

- 2019 American College of Cardiology Foundation

## References

- ↵
- Stone G.W.,
- Sabik J.F.,
- Serruys P.W.,
- et al.,
- for the EXCEL Trial Investigators

- ↵
- ↵
- ↵
- ↵
- Sever P.S.,
- Dahlöf B.,
- Poulter N.R.,
- et al.,
- for the ASCOT Investigators

- ↵
- Dahlöf B.,
- Sever P.S.,
- Poulter N.R.,
- et al.,
- for the ASCOT Investigators

- ↵
- ↵
- Cox D.R.

- ↵
- ↵
- ↵
- ↵
- ↵
- Stensrud M.J.,
- Aalen J.M.,
- Aalen O.O.,
- Valberg M.

- ↵
- ↵
- Royston P.,
- Parmar M.K.

- ↵
- ↵
- Pocock S.,
- Wang D.,
- Wilhelmsen L.,
- Hennekens C.H.

- ↵
- ↵
- Ford I.,
- Norrie J.,
- Ahmadi S.

- ↵
- Mostazir M.,
- Taylor R.S.,
- Henley W.,
- Watkins E.

- ↵
- ↵
- ↵

## Podcast