Author + information
- Received May 18, 2007
- Revision received July 19, 2007
- Accepted August 6, 2007
- Published online November 6, 2007.
- Esther S.H. Kim, MD, MPH⁎,
- Hemant Ishwaran, PhD†,
- Eugene Blackstone, MD, FACC†,‡ and
- Michael S. Lauer, MD, FACC, FAHA⁎,†,§,⁎ ()
- ↵⁎Reprint requests and correspondence:
Dr. Michael S. Lauer, 6701 Rockledge Drive, Room 10122, Bethesda, Maryland 20892.
Objectives The purpose of this study was to externally validate the prognostic value of age- and gender-based nomograms and categorical definitions of impaired exercise capacity (EC).
Background Exercise capacity predicts death, but its use in routine clinical practice is hampered by its close correlation with age and gender.
Methods For a median of 5 years, we followed 22,275 patients without known heart disease who underwent symptom-limited stress testing. Models for predicted or impaired EC were identified by literature search. Gender-specific multivariable proportional hazards models were constructed. Four methods were used to assess validity: Akaike Information Criterion (AIC), right-censored c-index in 100 out-of-bootstrap samples, the Nagelkerke Index R2, and calculation of calibration error in 100 bootstrap samples.
Results There were 646 and 430 deaths in 13,098 men and 9,177 women, respectively. Of the 7 models tested in men, a model based on a Veterans Affairs cohort (predicted metabolic equivalents [METs] = 18 − [0.15 × age]) had the highest AIC and R2. In women, a model based on the St. James Take Heart Project (predicted METs = 14.7 − [0.13 × age]) performed best. Categorical definitions of fitness performed less well. Even after accounting for age and gender, there was still an important interaction with age, whereby predicted EC was a weaker predictor in older subjects (p for interaction <0.001 in men and 0.003 in women).
Conclusions Several methods describe EC accounting for age and gender-related differences, but their ability to predict mortality differ. Simple cutoff values fail to fully describe EC’s strong predictive value.
Exercise capacity (EC) is a strong independent predictor of death among men and women (1–5), but its widespread adoption in exercise test interpretation has been hindered by its well-known correlation with age and gender (6,7). For example, a 35-year-old man who achieves 8 metabolic equivalents (METs) on exercise treadmill testing would not be considered to have the same EC as a 64-year-old woman who achieves the same number of METs. Several age- and gender-specific nomograms and categorical definitions have been proposed to describe normative values for predicted EC (3,6–12), but it is not known whether there are substantial differences in their prognostic power or their ability to adequately adjust for age-related effects.
We sought to externally validate the prognostic ability of previously published age- and gender-based nomograms and categorical definitions of EC and test their ability to fully account for age-related differences in a population of consecutive patients without known coronary artery disease who were referred for exercise testing. We deliberately focused only on externally derived models, and none of the models tested were derived from patients from our own institution. All-cause mortality was used as an unbiased and objective end point (13).
Consecutive patients (Table 1)referred for symptom-limited treadmill exercise testing between January 1, 1995, and December 31, 2002, were potentially eligible for study. To minimize possible bias due to training effects, we included only the first test performed for patients who had more than 1 exercise test during this time period. We excluded patients with known coronary artery disease (including silent Q-wave myocardial infarction), heart failure, clinically significant arrhythmias, valvular or congenital heart disease, cardiomyopathy, end-stage renal disease; patients with a history of prior organ transplantation and pacemaker implantation; patients with abnormal resting electrocardiograms (including left bundle branch block, right bundle branch block, intra-ventricular conduction delay, pre-excitation, pathological Q waves, and >1 mm of ST-segment deviation); and patients without a U.S. Social Security number. We only included patients undergoing testing after January 1, 1995, because before then height and weight were not routinely measured.
Permission to analyze the routinely obtained electronic data from our stress laboratory was given by the Cleveland Clinic Institutional Review Board. The requirement for written informed consent was formally waived.
As described in detail elsewhere (14–16), before exercise testing all patients in our laboratory undergo a structured interview and chart review. Type of diabetes was defined according to treatment (i.e., whether or not insulin was being used). Hypertension was defined as a systolic blood pressure ≥140 mm Hg, diastolic blood pressure ≥90 mm Hg, or use of medications specifically for treating hypertension. Patients were considered cigarette abusers if they regularly smoked currently or within the past year. All patients had height and weight directly measured (not self-reported) before testing. Body mass index was calculated as weight in kilograms divided by height in meters squared.
Methods for exercise treadmill testing in our laboratory have been described in detail elsewhere (14,15). Briefly, standard protocols (usually Bruce, modified Bruce, and Cornell) were chosen with a goal test duration between 8 and 12 min. All patients were exercised to exhaustion irrespective of heart rate achieved; however, tests were terminated in case of severe chest discomfort (≥7/10 on self-rating scale), significant arrhythmia, hypotension with evidence of clinical compromise, severe ST-segment changes, systolic blood pressure >250 mm Hg, or patient request. Patients were explicitly told not to grip handrails.
At rest and during each stage of exercise, data were prospectively recorded on-line regarding heart rate, blood pressure, symptoms, ST-segment changes, rhythm, and rating of perceived exertion (on a 1-to-10 scale, where 10 is maximum exertion).
Exercise capacity in METs (where 1 MET is 3.5 ml/kg/min of oxygen consumption) (17) was estimated on the basis of protocol, speed, and grade (11,17). If patients only achieved a portion of the final stage of exercise, credit for EC was “pro-rated” according to how much of the stage was completed. For example, if a patient exercised 1 min of a 3-min stage, they were credited with one-third of the increment increase. Chronotropic response to exercise was defined as the percent of heart rate reserve used (18,19). Heart rate recovery was defined as the change in heart rate between peak exercise and 1 min of recovery. For patients undergoing standard exercise testing or testing with nuclear imaging, a value of ≤12 beats/min was considered abnormal (14,15). For patients undergoing exercise echocardiography, a value of ≤18 beats/min was considered abnormal (20). Frequent ventricular ectopy in recovery was defined as frequent ventricular premature complexes, frequent couplets, bigeminy, trigeminy, ventricular tachycardia (nonsustained and sustained), torsades de pointes, and ventricular fibrillation (16). The ST-segment changes were considered ischemic if there was at least 1 mm of horizontal or down-sloping depression at least 80 ms after the J-point.
Prediction of EC
We systematically searched the literature for age- and gender-based regression equations of predicted EC (or oxygen consumption) and for dichotomous age- and/or gender-based definitions of impaired EC. Where multiple models were obtained from the same institution, we chose the one based on the largest patient sample. Names and detailed descriptions of these models are given in Tables 2 and 3.⇓For example, among men the “VA [Veterans Affairs] referral model” (7) predicts peak METs as: 18 − 0.15 × age (Table 2, top row). For women, the “St. James model” (5) predicts peak METs as: 14.7 − 0.13 × age (Table 3, top row). The Cooper models (3,10) define low EC as being below certain values for different age groups (Table 2, row 7; Table 3, row 3). The Mayo models (9) define impaired EC as <7 METs in men and <5 METs in women irrespective of age. It should be noted that none of the models we tested were derived from Cleveland Clinic patients; we deliberately focused only on externally derived models.
The primary outcome was all-cause death up to July 11, 2006. We ascertained deaths by using the Social Security Death Index (21). We have previously shown that this measure has approximately a 97% sensitivity for “detecting death” in our laboratory (15); others have documented a specificity of >99% (21).
All analyses were gender-specific. For descriptive purposes, we constructed Kaplan-Meier plots of cumulative mortality according to whether or not 85% of predicted EC was achieved. This 85% cutoff was chosen on the basis of suggestions in prior literature (6).
We constructed nonparsimonious multivariable Cox proportional hazards models for predicting time to death according to percent predicted EC achieved. Covariates included age, race, body mass index, diabetes (insulin-treated and noninsulin-treated), hypertension, current or recent cigarette smoking, medications (beta-blockers, nondihydropyridine, and dihydropyridine calcium blockers, angiotensin-converting enzyme inhibitors, aspirin, nitrates, and statins), resting heart rate, resting systolic and diastolic blood pressures, chronotropic response, peak systolic blood pressure, heart rate recovery, ST-segment changes, and frequent ventricular ectopy in recovery. Thus, all models had the name number of covariates. For dichotomous models of EC (e.g., Cooper and Mayo) we used a term for low EC rather than percent predicted EC achieved.
The Cox proportional hazards assumption was confirmed by calculation of Schoenfeld residuals. Non-linearity assumptions were relaxed by consideration of restricted cubic splines (22). In supplementary models, we tested for pre-specified interactions including age, body mass index, and race. For illustrative purposes, we constructed plots of adjusted, predicted 10-year survival as a function of percent of predicted EC achieved. In these plots age-stratified predictions are shown; all other covariates were held to either median or modal values.
We used 4 methods to compare different models for prediction of time to death. First, we calculated a modified Akaike Information Criterion (AIC) as: LR chi-square − 2p, where LR chi-square is the model likelihood ratio chi-square and p is the number of model parameters (22). By this formulation, higher values imply models that are closer to the truth.
Second, we tested for discrimination by calculating a c-index for right-censored data (23) in 100 out-of-bootstrap resamples. The c-index, which is analogous to the area under the region-of-interest curve for a purely dichotomous outcome, is calculated by comparing outcomes among patients who died with patients who did not die and had at least as much follow-up as those who did (23). It has a potential value of between 0.5 and 1.0, where 1.0 would imply perfect discrimination. To test the discriminative power of percent predicted EC, we calculated c-indexes only among patients who were not included in each bootstrap sample (i.e., “out-of-bootstrap” test sample). We randomly permuted each variable to see what impact this would have on the total model c-index. For a variable that strongly discriminates risk, this value would be large (i.e., converting that variable to noise in that out-of-bootstrap sample would result in a marked decrease in model discrimination). Third, as a measure of calibration we calculated the Nagelkerke Index R2(24). Finally, as an arguably better assessment of calibration, we performed 100 bootstrap resamplings in which patients were divided into quintiles of predicted risk. Within each quintile actual versus predicted survival rates were calculated, and the differences were averaged to derive a weighted calibration error (25). An example of a calibration plot is shown in Figure 1;the difference between actual and predicted Kaplan-Meier 10-year death rates was small across all levels of risk.
Statistical analyses were performed with the SAS version 9.1 (SAS Institute, Cary, North Carolina) and R 2.3.1 systems (The R Foundation for Statistical Computing, Vienna, Austria). Regression analyses and plots were performed with Harrell’s Design and Hmisc libraries (22).
Baseline and exercise characteristics according to gender are summarized in Table 1. There were 13,098 men and 9,177 women who met inclusion and exclusion criteria. Compared with men, women were older, more likely to be African-American, and somewhat more likely to have hypertension, but there were equivalent frequencies of diabetes and smoking. Women had a higher resting heart rate but similar body mass index. As expected, the median peak EC was lower among women.
Predicted EC and mortality
During follow-up there were 646 and 430 deaths among men and women, respectively. Both for men and women, failure to achieve 85% of predicted EC predicted substantially higher death rates. For example, Figure 2shows Kaplan-Meier death rates for men according to ability to achieve 85% of predicted EC on the basis of the VA referral model. Similarly, Figure 3shows that women who failed to achieve 85% of predicted EC on the basis of the St. James model were at markedly increased risk for death.
Comparison of EC equations as predictors of death
Tables 2 and 3show the predictive values of gender-specific multivariable models that used different EC equations. The first column describes the name of the measure, whereas the second column indicates the number of subjects upon which that particular equation was derived. The third column gives the actual equation for peak predicted EC or, in the case of categorical descriptions, the definition of low EC. For all models, percent of predicted EC achieved was a strong independent predictor of death (adjusted p <0.0001).
The fourth column presents the modified AIC for multivariable models according to which measure of predicted EC was used. It is important to note that each model used the exact same covariates (listed in the Methods section). The fifth column presents the relative importance of age and predicted EC on the basis of the change in the right-censored c-index from out-of-bootstrap samples. For all models, age was the strongest predictor of risk, whereas predicted EC was the second strongest (p < 0.0001 in all cases). The sixth column presents the Nagelkerke Index R2. Finally, the right-most column shows the bootstrapped corrected calibration error (that is, the weighted difference [in percent] for predicted versus actual Kaplan-Meier death rates at 10 years).
In men (Table 2), the VA referral model had the highest AIC and the highest Nagelkerke Index R2. All models showed low calibration error. For discrimination, however, the sedentary Air Force model seemed best. Specifically, for the VA referral model, age had an importance value of 0.20, whereas predicted EC was 0.04. This means that by randomly permuting age in out-of-bootstrap samples, model c-index fell by 0.20 (an enormous change) where, by similarly randomly permuting predicted EC, the c-index fell by a moderate 0.04. In the Air Force model, age was a less important discriminator (change in c-index 0.097), whereas predicted EC became a more important discriminator of risk (0.076); in other words, this model was more successful in using predicted exercise capacity to discriminate risk of death after accounting for age and other confounders.
Corresponding results for women are shown in Table 3. The St. James model performed best by all 4 model validation methods. Again, all models yielded a low calibration error.
For both the VA referral model in men and the St. James model in women, we found an important age interaction (p for interaction <0.001 in men and 0.003 in women) whereby percent of predicted EC achieved behaved differently for predicting death in different age groups (Figs. 4 and 5).⇓These interactions were significant even after adjusting for all confounders. For illustrative purposes, in Figure 4, the multivariable adjusted 10-year survival probability according to percent predicted EC achieved (VA referral model) is shown stratified by different ages. Among the youngest subjects, EC did not predict decreased survival until it had fallen to approximately 60% to 70% of predicted. Below these values, the association between survival and percent predicted EC achieved became fairly steep. In contrast, the association between predicted EC achieved and mortality was less pronounced in older subjects, and there was no clear “hinge point.” Among women (Fig. 5), the age-related differences were less pronounced, although the slope of the survival curve is steeper with each decade of life at percent predicted METs <100%.
Exercise capacity is known to be one of the most important predictors of death for men and women alike (1–5). In fact, the prognostic ability of the Duke treadmill score might in large part be driven by EC alone (26). Thus, defining normative values for EC is of utmost importance in accurate risk prediction after stress testing. The purpose of our study was not to develop a new nomogram or improve a risk score for predicting death on the basis of exercise testing; we sought to compare previously existing definitions of EC in their prognostic abilities with a large external population through statistical measures of fit (AIC), discrimination (c-index in out-of-bootstrap resamples), and calibration (R2and bootstrapped calibration plots).
Although all models performed well, some models clearly predict mortality better than others. Of particular interest is that in both men and women, the categorical descriptors of EC do not predict death as well as age- and gender-based nomograms for predicted EC. An explanation for worse fit of these categorical models in male populations can be derived from inspection of Figure 4, whereby percent predicted METs predicts mortality in different manners on the basis of age by decade. There is a linear relationship between percent predicted METs and mortality in older patients, but in younger patients, there is a “hinge point” where approximately <85% predicted METs predicts increased mortality. In women, however, the slope of the survival curves on the basis of percent predicted METs becomes steeper with increasing age, but there is no value at which there is a “hinge point” for predicting increasing mortality on the basis of EC. This finding indicates that categorical descriptions of EC might correctly predict survival in younger men, but because of the age interaction and linear relationship between EC and survival in other groups, nomograms specific to decade of age might better predict mortality.
Insights from model validations and comparisons
A major strength of our study is that all the models we tested were derived from external data sets. In this regard, our study is the first to perform a series of external validations for previously published descriptions of EC. Model validation is a complex issue, however, because the universal truth is never known and the best investigators can do is identify which models are likely to be closer to the truth (27). Furthermore, validation involves different types of comparisons, with some considering calibration and others discrimination (28). Regarding calibration, which refers to differences between observed and actual event rates at different levels of risk, we found that all multivariable models worked well irrespective of what measure of predicted EC was considered. Regarding discrimination, which refers to the ability to distinguish higher- from lower-risk subjects, there were more marked differences in model performance. It is noteworthy, however, that for nearly all models considered EC was the second strongest discriminator of risk, with only age performing better. This finding stresses the high clinical value of EC in routine risk stratification of patients with suspected coronary disease.
Our study used METs as an estimate of the EC, because direct measurement of oxygen consumption is not routinely performed during exercise stress testing. Direct measurement of oxygen consumption might have provided a more precise measurement of the effect of EC on mortality; however, the use of METs to describe EC during stress testing is common, and its use has been well established (17). Other potential limitations of our study include the fact that our test population was derived from 1 referral center, patients with known coronary disease were excluded, and nomograms/definitions of impaired EC were derived from diverse populations (healthy U.S. Air Force crewmen in comparison with sedentary veterans referred for stress testing for clinical reasons, for example). Finally, we do not have follow-up data on our study population. Specifically, we do not know whether the clinical management of patients was altered as a result of exercise testing and reporting of EC (e.g., changes in medication prescribing practices or counseling for smoking cessation).
Although exercise testing is traditionally thought of as a diagnostic test for detection of obstructive coronary lesions, its strength lies in its prognostic ability to identify patients who are at increased risk for death (29). Arguably, EC is the strongest test predictor and should be reported and incorporated into routine clinical practice for risk prediction. We have shown that existing models of EC that account for age and gender are, as expected, strong independent predictors of risk. Recent work has focused on gender-specific nomograms with a clear temptation to define easy-to-remember cutoff values (e.g., 85%) for identifying patients with prognostically important impairment of EC (6). Although this relatively simple approach might work (Figs. 2 and 3), it does not fully capture EC’s prognostic value. Continuous measures of percent predicted EC achieved better describe risk than simple dichotomization; this is an observation consistent with that of other risk factors like blood pressure and cholesterol (30). More importantly, however, none of the models fully account for age-related effects. Figures 4 and 5are illustrations of the observation that even with the age- and gender-based nomograms, EC behaves differently as a risk predictor in older subjects. This behavior is also similar for “classic” risk factors; for example, smoking and cholesterol are weaker predictors of risk in older subjects (30).
Although none of the models completely account for the strong interaction between age and EC in predicting all cause mortality, of the nomograms available for use, we would recommend either the St. James model or the University of Washington model in women and the VA referral model in men. In an age where complex prediction models can be used by clinicians by entering variable fields during a stress test or clinical visit into a computer, we would recommend the routine incorporation of all available clinical and exercise test findings in a global prediction of risk rather than focusing on simplistic normal and abnormal cut-points, even for a prognostic variable as powerful as EC. We have previously derived and validated complex computer-based models for predicting mortality in patients undergoing exercise testing (31). Such comprehensive, integrated, computer-based models need to be improved to better account for age- and gender-related differences in EC as a next critical step for accurately assessing an individual patient’s risk and directing preventive care.
This work was funded by National Institutes of Health grants R01 HL-66004-2, R01 HL-072771-02, P50 HL-77107, and K12 HD049091.
- Abbreviations and Acronyms
- Akaike Information Criterion
- exercise capacity
- metabolic equivalent
- Veterans Affairs
- Received May 18, 2007.
- Revision received July 19, 2007.
- Accepted August 6, 2007.
- American College of Cardiology Foundation
- Gulati M.,
- Pandey D.K.,
- Arnsdorf M.F.,
- et al.
- Morris C.K.,
- Myers J.,
- Froelicher V.F.,
- Kawaguchi T.,
- Ueshima K.,
- Hideg A.
- Wolthuis R.A.,
- Froelicher V.F. Jr..,
- Fischer J.,
- Triebwasser J.H.
- Barlow C.E.,
- LaMonte M.J.,
- Fitzgerald S.J.,
- Kampert J.B.,
- Perrin J.L.,
- Blair S.N.
- Lauer M.S.,
- Blackstone E.H.,
- Young J.B.,
- Topol E.J.
- Watanabe J.,
- Thamilarasan M.,
- Blackstone E.H.,
- Thomas J.D.,
- Lauer M.S.
- Newman T.B.,
- Brown A.N.
- Harrell F.E.
- Burnham K.P.,
- Anderson D.R.