Author + information
- ↵*Reprint requests and correspondence:
Dr. Daniel B. Mark, Professor of Medicine, Duke Clinical Research Institute, 2400 Pratt Avenue, Room 0311, Durham, North Carolina 27705 OR PO Box 17969, Durham, North Carolina 27715, USA.
The medical literature overflows with articles offering to help clinicians make better predictions. These studies trumpet the virtues of prediction models and associated algorithms to help triage patients in the emergency room, to decide between medical or surgical therapy, and to identify asymptomatic patients who should take medications to prevent a future health catastrophe (1). Most of these heterogeneous (and sometimes conflicting) models have been the products of multivariable statistical analyses. Formerly an esoteric mathematical technique of interest only to a small group of statisticians, multivariable analysis is now widely used for the development of new prediction rules.
This burgeoning research focus on prediction rules is driven by growing recognition of two issues: 1) the central activity of clinical medicine is making predictions in the setting of uncertainty; and 2) doctors are not very good at this task. Health care providers do not usually think in terms of explicit risks or probabilities, but probabilistic reasoning underlies almost every clinical decision. Although we often associate predictions with prognostication, formulating a diagnosis also relies on probabilistic estimates. Furthermore, diagnostic and prognostic assessments are major determinants of subsequent management decisions.
Unfortunately, the unassisted human brain is poorly adapted to making and updating precise quantitative predictions. In the most common tests of clinical predictive abilities, doctors are given written summaries of key patient data and asked to make estimates of specific outcomes (e.g., operative mortality, five-year survival). These predictions are then compared with those from validated multivariable models and actual patient outcomes. In one careful head-to-head comparison of 49 cardiologists given detailed case summaries and a computer-based statistical model, the model was significantly more accurate predicting three-year survival and infarction-free survival in coronary artery disease (CAD) patients (2). In addition, there was substantial variability among physicians when predicting for the same patient. The level of physician experience (measured as years since completion of cardiology fellowship) did not alter predictive accuracy. A second study found that a prognostic treadmill score was better than both cardiologists and internists for predicting angiographic findings in patients with chest pain, and similar to expert cardiologists at predicting prognosis (3). A recent study of the predictions of operative mortality from coronary bypass surgery by cardiac surgeons found that even when these physicians were given access to a predictive rule, their estimates of outcome did not improve (4). The study investigators concluded that the surgeons trusted their own intuitive judgment over the statistical model. Although clinicians often contend that they acquire certain ineffable impressions from the direct clinical encounter that would be inaccessible to a computer and are critical in guiding their assessments, it remains debatable whether these qualitative impressions improve medical decisions or merely bias them.
On the other hand, statistical models offer significant inherent advantages over clinician predictors. First, they can correctly register the simultaneous importance of a dozen or more factors, whereas most clinicians are able to handle far fewer pieces of information at one time. Second, models assign identical predictions when presented with identical data, whereas clinicians are less consistent. The growing popularity of clinical guidelines suggests that consistent application of evidence-based medicine is a desirable feature of contemporary medicine.
Assuming that valid statistical models can be created that are better than most physicians at predicting outcomes, why are they not widely used clinically? Although some prediction rules, such as the Thrombolysis in Myocardial Infarction trial risk scores and the Goldman non-cardiac surgery risk models, have achieved a measure of acceptance into practice, many other models and scores have not. Several barriers to implementation exist. First, the resistance of clinicians to the use of these tools may reflect tacit acknowledgment that they do not know how to take advantage of the incremental predictive value provided. Underlying the proliferation of prediction rules is the seductive but unproven assumption that better predictions will translate into better management. At present, coupling between outcome predictions and management decisions remains unclear. The best, most accurate predictions of outcome do not guarantee consensus on what should be done next for the patient.
Second, physicians who struggle to calculate creatinine clearance on the basis of the four factors of the Cockroft-Gault equation may find models that incorporate eight or more factors unwieldy even with a pocket card, nomogram, or calculator. Third, the enormous number of models available for specific questions makes it difficult for busy clinicians to find and implement the right model at the right time. Although one or two particularly useful models may be implemented in a busy practice, clinicians do not have the time to find a model for every situation.
Fourth, many current models do not account for the dynamic, iterative way in which patient information becomes available. Often, the initial wave of information presented to the clinician consists of demographics, history, physical examination, and a few laboratory tests. Additional information accrues from subsequent diagnostic tests and therapeutic trials, informed by the results of preceding management decisions. This serial updating of predictions is the essence of Bayesian statistics: the post-test probability is determined by the pretest probability, the test result, and the accuracy of the test. But Bayes' Rule becomes difficult to apply in a repeated sequential evaluation, partly because the information from each new test is usually at least partially correlated with data that are already known. Multivariable statistical models can account for this redundancy more effectively, but at the cost of added computational complexity.
The exercise treadmill test provides a valuable example of the benefits and challenges of making accurate clinical predictions. Although cardiologists increasingly prefer imaging stress tests, “plain old treadmills” are still performed frequently by internists and family physicians, and remain the initial test of choice for many patients according to the American College of Cardiology/American Heart Association Exercise Testing Guidelines (5). Even without imaging, the exercise treadmill test provides several useful prognostic measures, some only recognized recently (6). For many clinicians, the treadmill test provides two major data elements, a “positive” or “negative” ST-segment response and “adequate” or “inadequate” exercise level based on the maximum exercise heart rate achieved. In 1978, McNeer et al. (7)proposed a simple prediction rule for identifying high-risk patients, the “early positive treadmill,” defined as exercise ST-segment depression ≥1 mm and an exercise time of 6 min or less on the standard Bruce protocol. In 1987, McNeer's work was extended using multivariable analysis of a broader range of candidate variables in a cohort of inpatients undergoing both exercise testing and cardiac catheterization, resulting in the Duke Treadmill Score (DTS) (8). Importantly, the DTS was validated in a population separate from the one used to derive the score. The predictive ability of the DTS was further tested in an outpatient cohort at Duke (9)and at other medical centers (6,10). As a result of its robust prognostic value, the DTS has been incorporated into several clinical practice guidelines. In addition, many research studies assessing imaging techniques in exercise testing have used the DTS as a benchmark for prognostic value.
In this issue of the Journal, Morise and Jalisi (11)compare the incremental prognostic value of two new Treadmill Scores (one for men and one for women) with the DTS in a sample of 4,640 patients with suspected CAD. Using the end point of all-cause death, these new scores stratified CAD risk well and had better incremental value than the DTS for risk stratifying patients grouped by their pretest score.
Have Morise and Jalisi (11)built a better mousetrap? What are the criteria for replacing one predictive model with another? From the plethora of predictive models for cardiovascular medicine, which ones should be incorporated into practice? In our view, at least three factors should be considered in such a decision. First, does the new model or predictive rule provide a significant and reproducible increment in predictive information? Second, will the added information from the new model translate into important changes in patient management? And finally, will the new model actually be used by practitioners at least as often (and hopefully more often) than the older model?
How do the new Treadmill Scores stack up by these criteria? The DTS contains only three exercise test variables: exercise ST deviation, exercise time, and exercise angina. The new models contain three exercise variables (exercise ST depression, exercise angina, and maximum exercise heart rate) plus four (for males) or five (for females) clinical variables. Is the apparent improvement in predictive performance of the new models due to better use of exercise data, or the addition of clinical data to the exercise variables? Because two of the three exercise variables are the same in the DTS and the new models, is exercise heart rate a better prognostic variable than exercise intensity? In the original work to derive the DTS, we found that exercise time (presumably a surrogate for peak metabolic equivalent level attained) was a far superior prognostic variable, whereas peak exercise heart rate was a superior diagnostic variable (8). The superiority of exercise time as a prognostic variable continues to be evident in our more recent data sets, as well as in data from other centers (12). Based on these observations, we believe that by using models optimized for the diagnosis of CAD, Morise and Jalisi (11)likely understated the prognostic value of exercise test data. Thus, the improved prognostic strength of their new models appears to derive primarily from their clinical rather than exercise test elements.
Whether a “Treadmill Score” should include pretest clinical variables is presently a matter left to the discretion of the score designer. In general, one would expect predictive power to increase with the addition of a moderate number of clinical variables (overfitting with too many variables can actually degrade predictive performance). The major trade-off is increasing computational complexity. We suspect that clinicians would accept the substitution of a more complex score for a simple one if they perceived the new score to be a substantially more powerful tool in the management of patients. Proving incremental utility in practice involves more than demonstrating a modest increase in the area under the receiver operating characteristic curve. At the present time, models are being derived for use by clinicians, not by computers. Thus, the relevant test of incremental value is not relative to another model, but rather to the judgment of the target clinician audience. Examining the incremental value of a predictive rule or score in this way is difficult, but it can be done (13,14).
If the models of Morise and Jalisi (11)improve on the DTS by adding in pretest clinical variables, what determines whether clinicians would view the added predictive accuracy as an acceptable tradeoff against the added computational complexity? To be competitive for adoption into clinical practice in the current era, prediction rules must not only demonstrate acceptable validity and satisfy a perceived clinical need, they must also possess intuitive appeal and be computationally simple. The progressive computerization of medical practice will eventually alleviate the latter two constraints as future medical information systems will transparently integrate complex statistical models into routine practice, requiring no special effort from the doctor. In such a future medical practice world, predictions of relevant outcomes will be accurately updated every time important new data are collected. As with today's clinical guidelines, practitioners will need to learn the strengths and limitations of these new tools and discover settings where they can be used most appropriately. Until then, many well-constructed predictive models will remain ignored by the very clinicians for whom they were created.
☆ Editorials published in the Journal of the American College of Cardiologyreflect the views of the authors and do not necessarily represent the views of JACCor the American College of Cardiology.
- American College of Cardiology Foundation
- Gibbons R.J.,
- Balady G.J.,
- Bricker J.T.,
- et al.
- McNeer J.F.,
- Margolis J.R.,
- Lee K.L.,
- et al.
- ↵Morise AP, Jalisi F. Evaluation of pretest and exercise test scores to assess all-cause mortality in unselected patients presenting for exercise testing with symptoms of suspected coronary artery disease. J Am Coll Cardiol 2003;42:842–50
- Goldman L.,
- Cook E.F.,
- Mitchell N.,
- et al.