Author + information
- Donald E Cutlip, MD, FACC†,‡,
- Kalon K.L Ho, MD, MSc, FACC†,‡,
- Richard E Kuntz, MD, MSc, FACC†,§ and
- Donald S Baim, MD, FACC†,§,* ()
- ↵*Reprint requests and correspondence:
Dr. Donald S. Baim, Professor of Medicine, Harvard Medical School, and Director, Center for Integration of Medicine and Innovative Technology (CIMIT), Brigham and Women's Hospital, Boston, Massachusetts 02115, USA.
Humans beings have an intrinsic need to extract predictability from apparent chaos, like the weather. Weather reports consequently take various forms: 1) based on all available current meteorological data, what is the estimated probability of rain next Saturday morning (when I am supposed to play golf)? 2) summarizing and comparing trends and variances, how does last month's rainfall compare to what we usually expect for July in Boston? In the world of coronary intervention, the analogous questions are as follows: 1) what is the chance that Mrs. Jones is going to die during her planned procedure or before hospital discharge? 2) how do the outcomes of her interventionalist (Dr. Smith) compare with those of other good interventionalists treating a similar mix of cases (the latter question now known as “score-card” medicine)? Once these models are established and an increased risk is anticipated, it is reasonable to also ask, “given the increased risks in this case, what can I do differently to prevent or attenuate the anticipated complications?”
Even in the earliest years of coronary angioplasty, Andreas Gruntzig established the precedent of collecting detailed demographic and angiographic data as well as short- and long-term procedural outcomes of his patients. As the number of coronary interventions increased in the mid- and late 1980s, such data collection continued and was used to populate databases whose summary outcomes could be examined. The availability of personal computers and powerful statistical tools then allowed these databases to be examined in greater detail to identify correlates of particular outcomes of interest. In fact, the very evolution of modern percutaneous intervention has been driven by the information gained from these clinical databases, whether large single-center experiences, regional registries, medical society or government-sponsored initiatives, or multi-center clinical trials. The extent to which we are able to properly measure this type of data and use our analysis to drive improvements in practice will determine whether the quality of care and the outcome after percutaneous coronary intervention (PCI) will continue to improve.
The study by Qureshi et al. (1)in this issue of the Journalis the latest in that series of efforts that concentrate on defining the most important clinical variables for predicting the single most important adverse outcome: in-hospital death. The facts are familiar: the overall mortality is 1.3%, but there are some factors (acute myocardial infarction [MI], age, multi-vessel disease, and baseline renal dysfunction) whose presence separately or in combination increases the likelihood of death to 30% and whose absence reduces the risk of mortality to 0.2%. What uses can we expect to make from this model, and how does it differ from earlier models?
The role of outcomes databases
The value of any such model depends on many factors: the number of patients, the detail and quality control of data collection (baseline, procedural, and outcome), the completeness of ascertainment (particularly of short-term and follow-up outcome events), referral biases that strongly affect outcome (e.g., a regional cardiogenic shock center vs. an elective-only center), the quality of the operators, and the interventional tools available during the data collection period. In a field with rapidly evolving device and pharmacologic treatment strategies, such as interventional cardiology over the past decade, these questions are of great importance. In fact, our current expectations with drug-eluting stents, distal embolic protection, and glycoprotein IIb/IIIa blockers (success >98%, major complications <3%, late recurrence <7%) would seem utterly fantastic to an interventionalist in 1990 or even 1995.
Of equal importance are the statistical tools used to analyze the dataset and a clear understanding of their robustness for the particular forecasting uses that are planned. Published estimates of risk for an individual patientmay aid the patient and family in the consenting process or assist the operator in selecting or avoiding specific devices or adjunctive pharmacotherapy. The certainty of this type of prediction, however, is limited if there are differences between the model set and the patients to whom the model is being applied (some differences not being fully captured in standard angiographic and clinical variables) or if there are other statistical issues such as sampling variability and random variability in operator performance. These limitations are of particular concern when the model is going to be used to provide a performance “scorecard” for other operators.
Developing a risk prediction model
The general strategy of the risk-prediction process includes having access to a large, detailed, and relatively contemporaneous dataset and understanding its intrinsic limitations (which variables were collected, whether they were ascertained in all patients, whether data coding used uniform definitions and unbiased collection agents, whether angiographic data were evaluated by the operators or by a core laboratory, and so forth). A classic multi-variable model requires that a relatively small number of pre-specified potential risk factors be selected based on clinical logic (too many candidate variables or post hoc selection of such variables increases the risk to type 1 [false positive] results). Because these variables may be related to each other (e.g., congestive heart failure, left ventricular function, previous MI), careful multivariablemodeling should then be used to identify which remain as independent predictors after adjustments are made for all other variables in the model.
The currently available models have good utility, but each has some level of limitation (2–9). Several were developed within single centers (6,8,9), specialized centers (4), or particular geographic regions (2,5)and thus may have limited generalizabilityto other populations. Many years may elapse between the collection of data, analysis, and eventual publication, compromising applicability to contemporary practice by the time the results are available. Moreover, robust models require a large sample size to predict outcomes that occur infrequently. For in-hospital mortality after PCI, a 10,000-patient database with a 1.5% overall mortality has only 150 events, which limits the number of variables it can test (roughly 20 events are required for each variable tested). Most of the databases used for model development and validation have thus included far too few patients for complex models of mortality prediction.
The quality of any model must also be measured carefully. A model that is constructed using a given population (the test set) is then validated by testing the model either in another portion of the same database—by jack-knifing or bootstrapping—or in a separate external database (the validation set). These procedures reduce the chance that a detected predictor was due to a unique property of the test set rather than being a robust predictor. The statistical qualityof the proposed models relies on two measures regarded as measures of quality: discrimination and goodness-of-fit. Discriminationis usually measured by the c statistic. This reflects the area under the receiver-operating curve and thus is a measure of the model's ability to assign true positive outcomes as opposed to false positives. Models with a c statistic approaching 1 have perfect discrimination with a false-positive rate of 0% and a true-positive rate or sensitivity of 100%. Logistic regression models with cstatistics in the range of 0.80 are usually considered to have high discriminatory ability, but this means that the model will still miss 20% of the patients with that adverse event. In fact, it would be only slightly better than a model with no discriminatory ability, which has an area under the received operating curve or straight line of 0.50! The goodness-of-fitis frequently assessed using the Hosmer-Lemeshow test, which determines the difference between the event rate predicted by the model and the observed rate. A p value >0.1 usually indicates that the model provides a good fit for the data and that differences are not statistically different, but it does not exclude potentially clinically significant differences between observed and predicted outcomes. Therefore, high scores for discrimination and goodness of fit do not necessarily mean that the model has high predictive accuracy for individual patients. A mortality predictor model for a population of 10,000 patients may thus predict a mortality of 10% for the highest decile, but even with a perfect fit, we do not know which 100 of those 1,000 patients will die.
The good news is that the available PCI mortality models provide some reassuring features despite these inherent limitations. First, each of the models included in Table 1does have high discriminatory and goodness-of-fit scores. Second, even though the models represent patients from different eras and various populations, their strongest predictors are remarkably consistent and relate mostly to patientrather than technical variables. This has been substantiated in a recent report from the National Heart, Lung, and Blood Institute dynamic registry (NHLBI), in which three of the five tested models developed in the pre-stent era (New York State, Northern New England Cooperative Group, and Cleveland Clinic Foundation) showed excellent correlation for predicted and observed mortality among patients in the NHLBI database treated between 1997 and 1999 (10). This is somewhat less certain for angiographic lesion factors, however, because many of the lesion characteristics in the American College of Cardiology/American Heart Association classification scheme (e.g., lesion eccentricity) have been eliminated as technology has improved. Also, some of the remaining angiographic variables are actually surrogates for basic clinical variables (e.g., recent total occlusion is a surrogate for acute MI) (8).
The model presented by Qureshi et al. (1)is simple enough to use at the bedside, appears to be useful for forecasting procedural risk for some important patient groups, and has high discriminatory ability and calibration. Knowing that a patient is in the highest risk group, whose expected mortality is >10 times higher than the lowest risk group, may be useful in giving patients and families a more refined risk estimate than the routinely quoted “1% mortality” and may assist the operator in making decisions regarding the use of certain therapeutic options.
But knowledge of the most reliable predictors should also allow comparison of outcomes observed for different operators or hospitals to the outcomes expected based on the predictive model. This would ideally allow appropriate and complete adjustment of the treated population for significant differences in baseline risk, and thus allow fair comparisons between operators or hospitals (the rainfall in July question). There are several concerns, however, with using this model for comparing different operators and hospitals in a scorecard fashion. The boundary selected by Qureshi for each of the four variables is arbitrary and does not delineate among various levels of increased risk. For example, although patients over age 65 are at higher risk, this risk is certainly higher for an 86-year-old than a 66-year-old patient. Likewise, no one would question a higher overall risk for patients with MI within 14 days, but the highest-risk patients would be those being treated for acute MI within 24 h, particularly if they have hemodynamic instability. Similar arguments can be made for the other two variables of creatinine >1.5 mg/dl and multi-vessel disease, which the model considers as unqualified binary variables. This considerable degree of smoothing of the overall risk curve by using these dichotomous cutoffs may lead to a systematic underestimation of the risk for the truly high-risk patient and a significant overestimation of the risk for many other patients, thus failing to adjust adequately for higher- or lower-risk cohorts across operators and hospitals for which the distribution of variable values may differ from the test set.
Risk adjustment models, scorecards, and quality improvement
Although such simplified risk predictor models for scoring individual patients may have limited utility beyond what an experienced clinician can surmise using even less complex methods of clinical assessment, the future for true risk-adjustment models appears much brighter. Keeping performance scores of individual operators or, more commonly, for institutions has been of increasing interest over the past 10 to 15 years, following the lead of cardiac surgery. Even though such scoring systems are initiated for the purpose of quality assessment and improvement, public reporting of results and the dissemination of provider rankings add an element of fear and anxiety for many providers, if data unadjusted for riskare published, as they were by Medicare for coronary artery bypass grafting surgery in 1987. This underscores the importance of using the most refined and scientifically valid methods for risk adjustment.
Unfortunately, even the best and most sophisticated multiple regression models developed for the purpose of risk adjustment have serious deficiencies and limitations, as discussed previously. Moreover, even the best models cannot compensate for the smaller sample sizes present at the institution or operator level and the associated statistical uncertainty. The wide resulting confidence intervals make it virtually impossible to provide any meaningful estimation of appropriateness of outcome for the low-volume operator or institution. A low-volume operator may look very good or very bad depending on how his or her last case went, and such models cannot fully correct for all confounders of risk in a small sample size. There are additional problems, such as failure to account for sampling variability, unmeasured confounding, and random variability (noise) between operators that are not fully correctable by any model, so that when resulting data are disclosed publicly, any risk-adjustment effort must be viewed as imperfect rather than as a true leveling of the playing field.
Given these significant problems with multiple regression models, Shahian et al. (11)have suggested the use of hierarchical or random-effect models for risk adjustment among cardiac surgery providers. The hierarchical models reduce the overly optimistic precision estimates by attempting to adjust for confounding by variances in treatment decisions between physicians and patients in the predictor dataset. Accounting for random operator effects dampens variability toward the mean and thereby provides more reliable estimates (12). Although they are much more complex, they are not beyond the capacity of groups involved in the risk-adjustment exercise.
In summary, the objective for any risk-prediction or adjustment tool should be to foster continuous quality improvement. Although simple bedside scoring as proposed by Qureshi et al. (1)may be of some use for classifying patients into broad risk categories, the ramifications of bona-fide risk adjustment demand more complex systems. Public presentation of the results must be undertaken cautiously and with adequate explanation of limitations to avoid unnecessary punitive components that might lead to gaming of the system (e.g., by avoiding high-risk cases, which may deny benefit to the patients with the most to gain from a high-quality procedure). Although the tracking of performance scores within individual centers and the comparisons with regional or national standards are desirable, those centers should also implement the minimum volume standards that have been shown to be reasonable, if not perfect, surrogates for performance quality (4,13). Finally, it is not clear whether mortalityis the appropriate outcome measure, given its low frequency and the increasing difficulty in predicting risk as the frequency of the studied event diminishes. Other ways of measuring the success of a procedure and sound judgment, rather than the natural history of an acute illness, may be more useful. Physician-led continuous quality improvement initiatives that include the reporting of specified measurements of the success of a procedure have been effective in cardiac surgery (14). Regardless of the statistical methods used, however, the goal of continuous quality improvement is essential to our delivering the brightest forecast for the safety of our interventional cardiology patients.
↵* Editorials published in the Journal of the American College of Cardiologyreflect the views of the authors and do not necessarily represent the views of JACCor the American College of Cardiology.
- American College of Cardiology Foundation
- ↵Qureshi MA, Safian RD, Grines CL, et al. Simplified scoring system for predicting mortality after percutaneous coronary intervention. J Am Coll Cardiol 2003;42:1890–5
- Kimmel S.E.,
- Berlin J.A.,
- Strom B.L.,
- Laskey W.K.
- Ellis S.G.,
- Weintraub W.,
- Holmes D.,
- Shaw R.,
- Block P.C.,
- King S.B. 3rd.
- O'Connor G.T.,
- Malenka D.J.,
- Quinton H.,
- et al.
- Moscucci M.,
- Kline-Rogers E.,
- Share D.,
- et al.
- Ellis S.G.,
- Guetta V.,
- Miller D.,
- Whitlow P.L.,
- Topol E.J.
- Holmes D.R.,
- Selzer F.,
- Johnston J.M.,
- et al.