Author + information
- Received December 31, 1998
- Revision received July 13, 1999
- Accepted August 12, 1999
- Published online November 15, 1999.
- ↵*Reprint requests and correspondence: Dr. John S. Gottdiener, Cardiology Research, St. Francis Hospital, 100 Port Washington Boulevard, Roslyn, New York 11576.
This study sought to determine whether statistical analysis of a computerized clinical diagnostic database can be used as a tool for quality assessment by determining the contribution of reader bias to variance in diagnostic output.
In industry, measurement of product uniformity is a key component of quality assessment. In echocardiography, quality assessment has focused on review of small numbers of cases, or prospective determination of reader variability in selected and relatively small subsets. However, diagnostic biases in clinical practice might be discerned utilizing large computerized databases to determine interreader differences in diagnostic prevalence and, with use of appropriate statistical methods, to determine the association of reader selection with diagnostic prevalence independently of other covariates.
We analyzed 6,026 echocardiograms in a computerized database, read by one of three level 3 (American Society of Echocardiography) readers, for differences in frequency among four coded echocardiographic diagnoses: mitral valve prolapse, valvular vegetations, left ventricular (LV) thrombus, and LV regional wall-motion abnormality.
Significant differences (up to fourfold) were found between readers, which persisted after statistical adjustment for those population characteristics, which differed slightly between readers. The low population prevalence of these conditions would have made it unlikely that these interreader differences could be detected by nonstatistical methods. Additionally, chamber dimensions differed between readers and were not normally distributed.
Statistically based quality assessment analysis of computerized clinical databases facilitates ongoing monitoring of interreader bias despite low diagnostic prevalence, and targets opportunities for subsequent quality improvement.
In industry, attainment of product uniformity is a key component of quality control. A similar standard is appropriate in medicine, where the interpretation of diagnostic studies has a strong impact on patient management. Although epidemiological techniques have long been used to study disease characteristics, these methods have not found
widespread use in diagnostic quality control. Specifically, the concept of product uniformity as a measure of quality, which has enjoyed long-standing industrial application (1,2), has not been extrapolated to diagnostic cardiac imaging. Hence, we sought to determine whether statistical sampling of echocardiographic diagnoses could be utilized as a method of quality assessment.
Our hypothesis was based on an input-output model, whereby given the assumption of equal distribution of case mix over time (input), consistency of diagnostic statements and measurements (output) between readers can be utilized as a measure of quality. Patients (the clinical input), are processed by the diagnostic system, and diagnostic statements (the output), are generated. In a population of sufficient size, the distribution of clinical conditions interpreted by the individuals should be similar. Hence, variables that may influence output are equipment, sonographers, and differences in diagnostic styles among interpreting physicians. If the equipment and sonographer assignments do not differ between readers, then differences in the prevalence of diagnostic statements and in the distribution of quantitative measurements among readers must indicate interreader variability in the physicians’ interpretations. Even where there are confounders that potentially affect diagnostic output such as inequalities in patient characteristics or differences in sonographer and echocardiograph machine assignments between physicians, multivariate analyses can be used to statistically adjust the input variation to allow determination of the independent effect of physician reader as a predictor of variation in diagnostic output.
The study sample was drawn from the echocardiography database at Georgetown University Hospital. We queried the database for transthoracic echocardiograms, performed either electively or emergently, between November 1993 and June 1996, which were read by one of three level 3 echocardiographers (American Society of Echocardiography) (3).
Echocardiograms were performed primarily by technician sonographers at Georgetown University Hospital. The studies were recorded on videotape and were read on the same day as they were performed. Each physician was assigned at least one day of the week to read studies, which varied during the study time period. Studies were not assigned to reader based on particular expertise or clinical interest. All sonographers contributed randomly to the case mix of the interpreting physician. The echocardiographic results were recorded on a standardized form at the time of the study, and data were entered into an electronic database (FoxPro 2.6, Microsoft, Redmond, Washington).
The database contains a wide selection of variables, including demographic, clinical, and echocardiographic parameters. We collected both categorical and continuous data elements. Measured variables included chamber dimensions, estimated pressures, calculated valve areas, and ejection fraction (EF). Conditions such as left ventricular (LV) hypertrophy, valvular stenosis and regurgitation, and chamber dilation were graded from mild to severe when present. The database also permitted an equivocal response (“suspected”) for conditions where the diagnosis was uncertain.
We arbitrarily chose four conditions of clinical interest. The presence of mitral valve prolapse (MVP), valvular vegetations, LV thrombus, and regional wall-motion abnormality were determined by reader judgment. For diagnoses other than regional wall-motion abnormality, readers had the option of identifying each finding as “present,” “absent,” or “suspected.” For the purposes of this analysis, we merged “suspected” and “present” diagnoses into a single group. For MVP, interreader diagnostic prevalence was compared for both “present” and “suspected.”
Categorical variables were labeled as either “present” or “absent.” A blank response for a categorical variable was considered to indicate the absence of a particular finding. In the case of a continuous variable, a blank response was considered to represent the inability to measure the finding. Clinically relevant data were missing in a small proportion of cases. Given the large size of the database, we chose not to impute data.
The prevalence of each of the qualitative clinical diagnoses was compared between interpreting cardiologists using the chi-square test. Pairwise comparisons between cardiologists were done utilizing a chi-square test for all possible pairs of cardiologists.
We utilized the one-sample Kolmogorov-Smirnov test (4)to evaluate the normality of the distribution of the continuous variables—left atrial size, left diastolic dimension, and LVEF. For variables with an underlying normal distribution, we employed the parametric analysis of variance (ANOVA) test to compare mean values for each of the physicians. For parameters with a nonnormal distribution, we compared the medians for each reader using a nonparametric analog of one-way ANOVA (5). Differences in the distribution of continuous measurements between readers were assessed with the Mood median test (6), and differences in variance among the readers were analyzed by the Levene test (7).
Despite absences of known biases in patient population between reading physicians, different patient characteristics may have nonetheless accrued. Moreover, change in diagnostic standards or other time-based factors (one physician began reading later in the study than the others) may have also influenced echocardiographic diagnosis. Therefore, we performed multiple logistic regression to control for differences in patient age, gender, LVEF, and scan date on the predictive value of reader selection for the outcome variable studied. Both the SPSS 7.5 (SPSS, Chicago, Illinois) and SAS 6.12 (SAS Institute, Cary, North Carolina) software packages were utilized for data analysis.
A total of 6,026 echocardiograms were reviewed over the period of the study. Reader 1 interpreted 2,702 studies, representing 44.8% of all cases. Reader 2 evaluated 2,101 studies (34.9%) and Reader 3 evaluated 1,223 (20.3%) echocardiograms. The clinical characteristics of the three reader groups were similar (Table 1). The mean age of the population was 58.1 years, and there was no significant difference between readers in the mean patient age, proportion of women, or the proportion of patients with LV dysfunction (i.e., EF ≦ 35).
The prevalence of each of the conditions under study in our entire population is illustrated in Figure 1. Mitral valve prolapse was noted in 4.4%, vegetations were identified in 0.4%, regional wall-motion abnormalities were found in 12.7%, and LV thrombus was identified in 1.1% of cases.
Regional wall-motion abnormality was commonly found by all readers (Table 2), although readers differed significantly in prevalence of regional wall-motion abnormality (p = 0.007), with the variability predominantly explained by differences between Readers 1 and 3 (p = 0.003).
The prevalence of LV thrombus (Table 2)differed between readers (p < 0.001). There was no significant difference in the prevalence of thrombus between Readers 2 and 3, who identified clot in 0.4% and 0.5% of cases, respectively (p = 0.80). However, Reader 1 recognized clot nearly five times more often than did Reader 2 (1.9% vs. 0.4%, p < 0.001) and four times more often than Reader 3 (1.9% vs. 0.5%, p = 0.001).
The prevalence of MVP varied by interpreting reader (Table 2). Readers 1, 2, and 3 identified MVP in 5.3%, 3%, and 4.8% of the cases, respectively. The p value for the combined group was 0.001, indicating a significant difference in interpretations for the three readers. Pairwise comparisons of the readers indicated that Reader 2 identified MVP significantly less frequently than did either colleague. No significant difference occurred in interpretations between Reader 1 and Reader 3 (p = 0.57). Notably, the interreader difference in the prevalence of MVP was due to differences in the prevalence of “suspected” MVP (2.3%, 0.1%, 2.4% for Readers 1 through 3, respectively, p < 0.001); there were no interreader differences in the prevalence (3%, 2.9%, 2.4% for Readers 1 through 3, respectively, pNS) of “present” (unequivocal) MVP.
The interobserver variability was not significant with regard to the identification of valvular vegetations (Table 2). As with the aforementioned diagnoses, Reader 1 identified vegetations most frequently, in 0.7% of cases. Readers 2 and 3 identified vegetations in a similar proportion of cases, 0.3% and 0.2%, respectively. There was a trend toward a difference in interpretations between Reader 1 and Reader 2, but this did not reach statistical significance.
The effects of temporal and population factors on differences in reader output, determined by multiple logistic regression analyses, are described for each of the diagnostic outcome variables in Table 3. The date of the study had no impact on the interpretation of MVP, thrombus, and vegetations. However, scan date was a predictor of the presence of wall-motion abnormalities. There was a decreasing trend over time, from 15.7% to 12.1%, in wall-motion abnormalities. After adjusting for covariates, reader assignment remained an independent predictor of the diagnostic prevalence of MVP, thrombus, and wall-motion abnormalities. On pairwise comparison, Readers 1 and 3 differed in diagnostic prevalence of mural thrombus and wall-motion abnormalities; Readers 1 and 2 differed in the prevalence of MVP and mural thrombus.
The distributions of left atrial diameter, LV end-diastolic diameter, and LVEF for the entire population are plotted in Figure 2. In each case the distributions were not normal. The median values for each of these parameters, stratified by echocardiographic reader, are shown in Table 4. There was a statistically significant difference between readers in the measurement of left atrial dimension and LV end-diastolic dimension. However, the shape of the distribution of left atrial measurements was similar among the physicians. Although the median EF was the same among the physicians, the distribution of the measurement varied significantly.
“Quality control” refers to those techniques and activities used to assess, improve, and maintain the value of a product—that is, its quality (1). Physicians utilize various standards to measure quality in clinical practice. Using “clinical reasonability,” physicians frequently legitimize a test based on their knowledge of the patient and alternative data that may be available. In some circumstances, a physician may perform an additional test, one that has a higher sensitivity and specificity. This “gold standard,” when available, often entails more expense and risk. Another measure of quality is “reproducibility.” This may simply involve repeating the same test on the same patients and identifying whether the results are, indeed, precise.
Alternatively, one may choose to identify intraobserver and interobserver variability among a selected series of patients. Random sampling of a patient population can result in failure to evaluate conditions of low, or even moderate, prevalence. Moreover, these quality-control techniques, akin to product inspection in industrial practice, are limited by either small sample size or by prohibitive cost if widely applied. Targeting specific conditions may confer bias by alerting readers to the process of quality assessment. Each of these strategies has its flaws, and at present there is no ideal method for measuring quality.
We developed an innovative approach for quality assessment in the echocardiography laboratory. Utilizing statistical sampling of echocardiographic diagnoses, we demonstrated differences in the prevalence of diagnostic statements and differences in the measurement of various parameters among readers. Moreover, in the case of potential diagnostic ambiguity (MVP) we were able to define the diagnostic level of certainty at which readers varied in their assessments. An understanding of the types and sources of variations in echocardiographic diagnosis becomes critical in a quality-control analysis. We would like to differentiate variability in physicians’ interpretations from random variability intrinsic to an observational database. Such analysis falls under the category of a “statistical quality control.”
Although reduction in the variation of any process is beneficial, its elimination is impossible because of the many inevitable small, unobservable, and random effects that will influence the output. Quality-control theory (1), developed primarily to describe industrial processes, has named this random variation “controlled variation.” It is measurable and should be equal between echocardiography readers. In contrast, “uncontrolled variation” is due to special systematic causes that arise sporadically and for reasons outside the normally functioning procedure. Several factors may have accounted for the uncontrolled variation in reader interpretation.
First, in the absence of standardized definitions for various entities, physicians use different sets of criteria for diagnoses. Although three readers may identify the same visual finding; one may note a firm echocardiographic diagnosis, another describe it as “equivocal,” while the third dismisses it entirely. Also, the assignment of a diagnosis may be biased by the implications carried by the condition, particularly when the diagnosis is questionable. Finally, variability in interpretation skill or concentration may have accounted for the systematic interreader differences.
There are several advantages of an epidemiological approach to quality assessment. The use of an existing database for quality assessment is less cumbersome and labor-intensive than reader reproducibility studies, or test replication trials. This is of particular importance in an increasingly stringent medical economy that nonetheless requires demonstration of product quality. Of equal or greater importance, even substantial differences in diagnostic interpretation are likely to be missed by reader reproducibility assessments if the prevalence of the assessed condition is low. The use of large, statistically robust samples maximizes the likelihood of detecting “uncontrolled variation” of diagnostic output whether due to reader, sonographer, or machine variability. Moreover, the utilization of appropriate statistical methods allows identification of, and statistical adjustments for, assignment biases and differences in patient characteristics.
There were several limitations to our study design. Assignment bias may have existed in the distribution of echocardiographic studies among the individual physician readers, resulting in differences in prevalence. However, even after adjustment for demographics, time factors, and LV function, the physician reading the study remained an independent predictor of diagnostic prevalence and quantitative measurement. Failure to find interreader diagnostic differences may have occurred because a sample of 6,000 patients may be inadequately powered for conditions of low prevalence (type II error). However, in the present study, the discovery of uncontrolled variation in diagnostic output in conditions of low population prevalence such as LV thrombus is unlikely to have occurred using conventional quality assessment approaches in clinical echocardiography.
The findings of interreader differences may vary according to how diagnoses are coded and grouped. For example, our readers differed in the prevalence of “suspected” but not in unequivocal mitral prolapse. Hence, merging suspected with unequivocal diagnoses provided different information than separate analyses. Nonetheless, the flexibility of this approach to quality assessment allows physicians to design queries that are pertinent to their clinical practice, and to modify diagnostic styles as appropriate. In this example, to achieve uniformity, laboratory readers would have had to decide whether to eliminate reporting of equivocal cases of prolapse, or request that the second reader reassess his or her threshold for consideration of the presence of mitral prolapse. Such decisions need to be made within the context of appropriate medical practice. If it were considered that borderline evidence of mitral prolapse has little prognostic importance, and increases the likelihood for undue concern on the part of the patient, then the first choice would be appropriate. In the case of vegetations and thrombi, we considered suspected and definite diagnoses to have similar clinical implications. Hence, grouping these to achieve statistical power was deemed appropriate.
Finding variation between readers in measurements or categorical diagnosis does not detect which readers are correct. Independent verification of diagnoses with a “gold standard” would be required. That could be done by comparison with another, better test or by comparison with pathologic anatomy at surgery or autopsy. However, redundant testing is not encouraged by the current economic climate, and anatomic confirmation is only rarely available.
Readers may differ in diagnostic thresholds for labeling echocardiographic findings as representing pathology. For example, an increase in echocardiographic density and thickness of a valve noted equally by two physicians may be described as “suspected” vegetation by one, and “nonspecific” valvular thickening by the other, even though both observe the same characteristics of the image. However, whether interreader differences in diagnostic prevalence result from criteria differences, or differences in perceptions of the image data, they are nonetheless causes of product variability as perceived by patients and by referring physicians. Hence, they are problems of quality control.
Our study indicates that database monitoring permits efficient quality assessment and hence opportunities for quality improvement. One such approach to diminish uncontrolled variation in echocardiographic diagnostic output would be selection of diagnostic problems identified by database analysis followed by entrainment of parallel reading styles among the interpreting physicians by joint reading. Efficacy could than be determined by reanalysis of the database subsequent to reader training. A statistical approach could also be utilized (i.e., how does one lab compare to another, or to all others?). Perhaps the future use of shared database formats may allow benchmarks to be determined for uncontrolled variation.
In conclusion, substantial interreader differences may exist in clinical practice. Assuming no, or identifiable, assignment biases, computerized databases facilitate ongoing monitoring of interreader bias despite low diagnostic prevalence. Industrial models of quality control may be important to quality control not just for echocardiography but also for other diagnostic and therapeutic aspects of medical care.
↵1 Dr. Berger’s present address is Division of Cardiology, Yale–New Haven Medical Center, New Haven, Connecticut. Dr. Gottdiener’s present address is Cardiology Research, St. Francis Hospital, 100 Port Washington, Roslyn, New York 11576.
- analysis of variance
- ejection fraction
- left ventricle, left ventricular
- mitral valve prolapse
- Received December 31, 1998.
- Revision received July 13, 1999.
- Accepted August 12, 1999.
- American College of Cardiology
- Deming W.E
- Ryan J.P
- Stewart W.J,
- Aurigemma G.P,
- Bierman F.Z,
- et al.
- Levene H