Author + information
- aThe Academic Medical Center-University of Amsterdam, Naarden, the Netherlands (Emeritus Professor of Clinical Epidemiology and Biostatistics)
- bBiostatistics, Christiana Care Health System and the Christiana Care Center for Outcomes Research, Wilmington, Delaware
Recently, the American Statistical Association (ASA) released a statement on the use and misuse of p values in scientific research in response to a growing concern regarding the misunderstanding of what a p value is and common misuses and misinterpretations in the published scientific data (1). The ASA statement informally defines a p value as “the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value” (1). They identify and discuss 6 principles related to the use of p values in scientific research.
• Principle 1. p values can indicate how incompatible the data are with a specified statistical model.
• Principle 2. p values do not measure the probability that the studied hypothesis is true or the probability that the data were produced by random chance alone.
• Principle 3. Scientific conclusions and business or policy decisions should not be only on the basis of whether a p value passes a specific threshold.
• Principle 4. Proper inference requires full reporting and transparency.
• Principle 5. A p value, or statistical significance, does not measure the size of an effect or the importance of a result.
• Principle 6. By itself, a p value does not provide a good measure of evidence regarding a model or hypothesis.
In light of this statement from the ASA, we comment on the use and reporting of p values.
Statistical and Clinical Significance
Almost every statistical methods paragraph includes the statement (or some variation), “A p value ≤0.05 was considered statistically significant.” Entrenched in tradition as the value 0.05 is, it is an arbitrary cutoff value for statistical significance. A p value ≤0.05 does not confer importance or meaningfulness (nor does a p value <0.001). In many results sections, however, it becomes apparent that 0.05 was meant to be the “cut-off” for variable importance, that is, statistical significance equals importance or meaningfulness. Statements such as “the difference was highly significant (p = 0.001),” “there was a trend toward significance (p = 0.06),” or “variables that were significant at p ≤ 0.05 in univariable analysis were included in a multivariable model” indicate a value judgment regarding variable importance. A p value of 0.001 would be statistically significant if the designated cutoff was 0.05, but it would not be statistically significant if the designated cutoff was 0.0001. A “trend toward significance” is probably intended to mean that the observed p value was “close” to (define close) but a little larger than the stated level of statistical significance—thus, the result is “close to being important.” Selecting variables for a multivariable model on the basis of the variable’s p value when analyzed separately implies that the p value confers “importance.” Not only that, but it is bad statistical practice as well. Variables that are not statistically significant when analyzed separately may, in fact, contribute significantly to the model when included with other variables. It is also true that variables that are significant in univariable analysis may not contribute statistically to the model when included with other variables.
p Values and Confidence Intervals
In a randomized trial, the p value is driven by the magnitude of the treatment benefit (expressed as a relative risk or a risk difference) and the sample size. A small benefit (a relative risk of 0.90 or a risk difference of 0.5%) is associated with a small p value (p < 0.001) in a very large trial, whereas it is associated with a nonsignificant p value in a small trial. For this reason, p values should only be used secondary to the valuation of treatment benefit, preferably in terms of both relative risk and risk difference. The statistical uncertainty in the estimated treatment benefit is primarily quantified via the 95% confidence interval (CI), which basically represents a range for treatment benefit compatible with the observations of the trial. If the 95% CI excludes 1 for the relative risk (or 0 for risk difference), the outcome of the trial is incompatible with null hypothesis of no treatment benefit. The p value follows the 95% CI: if the 95% CI excludes 1 for relative risk or 0 for risk difference, the associated p value falls below 0.05. In other words, the p value adds little or nothing to the 95% CI. We, therefore, promote the use of effect estimates together with CIs, supplemented with p values if needed.
Tables (e.g., the typical Table 1) comparing ≥2 groups with respect to a large number of variables often appear in papers submitted for publication as well as in published papers. Along with summary statistics, a p value accompanies each variable, indicating that a hypothesis with respect to differences between groups has been tested. In this practice, the p value, perhaps unintentionally, is being used to identify supposedly important or meaningful differences. The statement in the statistical methods section regarding the statistical significance of p ≤ 0.05 does not raise any eyebrows, but if stated more accurately, such as “We tested 25 (or more) hypotheses and set p ≤ 0.05 as the cutoff for statistical significance for each one,” it would likely cause more concern. For long lists of variables where comparisons are made between groups, a better strategy for comparison would be to replace the p value column with a column of standardized differences. This would speak more to the importance of the difference, and although there may still be controversy over what is an important difference, the controversy can be debated among experts in the field without recourse to a p value that does not convey importance—and is arbitrary as well. In studies with very large sample sizes, the inclusion of a column of p values is even less informative because small, even trivial differences may be statistically significant.
On the other hand, p values that are greater than the stated cutoff value cannot be interpreted to mean that there is no real difference between study groups. Nonsignificant p values only mean that evidence was not found for rejecting the null hypothesis. Maybe there is no real difference between groups, but maybe the p value is nonsignificant because the study lacked power (i.e., large enough sample size) to detect the difference the investigator thought was important or meaningful from a clinical perspective.
Some years ago, it was common practice in several scientific journals to do a post hoc calculation of power if the results were not statistically significant. Essentially, the power calculation was used to make further inferences about the meaningfulness of the results. Results, even nonsignificant results, may be used to estimate sample size and calculate power for a future study, but a post hoc power calculation cannot be used to make inferences about the current study (2). We think this practice has never been promoted by JACC journals, although it has appeared a few times as a suggestion for investigators.
p Values and Small Sample Size
For a number of legitimate reasons, some studies can include only a small number of patients/subjects (e.g., rare disease). In these studies, as well as in studies that could not, for one reason or another, enroll enough patients to meet the required sample size calculation, application of statistical hypothesis testing and interpreting the results may be misleading—statistically nonsignificant results could likely be attributed to lack of power, and statistically significant results may be spurious. The question of whether the sample is representative of the population is always present, even if the sample size is “adequate” on the basis of a power calculation, but it is even more of a concern with small samples. Depending on how small the study, a descriptive study with standardized differences may be more appropriate than an inferential approach with p values.
Statistical hypothesis testing begins with the assumption that the null hypothesis (e.g., no difference between means, zero correlation) is true. Given this assumption, a p value is the probability that the calculated test statistic (e.g., t or chi-square) could be as large or larger under the assumption that the play of chance is the only explanation for the observed differences. A p value less than or equal to the pre-determined cutoff value, whether it be 0.05 or some other value, implies that we have evidence to reject the null hypothesis—it does not mean that the null hypothesis is false or that the alternative hypothesis is true. A p value greater than the pre-determined cutoff value implies that we do not have enough evidence to reject the null hypothesis—it does not mean that the null hypothesis is true. A p value is 1 piece of evidence for assessing the adequacy, usefulness, or importance of a statistical model, whether it be a simple comparison of means or a multivariable model predicting an outcome. But, by itself, it is not a good indicator of model adequacy, usefulness, or importance. Statistics play an important role in scientific research when sampling from a population. Understanding that role and appropriately applying statistical reasoning is necessary for good science.
Profs. Tijssen and Kolm serve as statistical reviewers for the JACC journals.
- American College of Cardiology Foundation