While statistical significance often is mistaken as an indication of practical importance or scientific relevance, an even greater mistake is to believe that statistical non-significance indicates equivalence or "no difference". It doesn't. Statistical non-significance reflects uncertainty, which perhaps can be considered as an indication of a too small sample size.
Jonas Ranstam PhD
2,673 wordshttps://ranstam.eu @jonasranstam Thank
All medical scientific publications do not present evidence based research. Many, if not all, hypothesis presentations, non-systematic reviews, and case reports are authority based rather than evidence based. Also such publications may have a role to play for the progress of science. It should, however, always be made clear to the reader whether the author's ambition has been to present a personal opinion or an objective and reproducible research finding. From an editorial point of view, it may be difficult to distinguish between these two types of manuscripts; the author's ambition is seldom declared and statistical inference is often misused. Personal opinions should, of course, not be statistically reviewed because statistical reviewing can make a good expert opinon appear to be bad and a poor one to appear to be good.
A manuscript that is entirely based on assumptions presents a hypothesis. Manuscripts written for presentation of empirical findings must be based on data. As Edward Demings said, "In god we trust, all others must bring data". However, in order to analyse data, assumptions must be made. When presenting the analyses its results, the author's presentation must clearly distinguish between observations, assumptions and analysis outcomes. Confusing assumptions and outcomes is not a good thing.
Modern medical research claims to be objective and reproducible. The reproducibility has, however, recently been questioned. One explanation for this could be that while the findings may appear to be based on sound objective research, they actually just represent subjective opinion.
In contrast to well-performed clinical trials many laboratory studies, including those based on statistically correct methods, do not have a well-defined pre-specified study design linking the investigated study hypothesis with the many statistical null hypotheses tested by the investigator. Instead of letting the experiment directly provide the outcome of the study, as in a randomised trial, the investigator uses his or her expert knowledge to interpret an abundance of p-values and to formulate an expert opinion about the study hypothesis.
Apart from the subjectivity and fallibility of this experimental strategy, another drawback is that statistically oriented reviewer comments, for example regarding the consequences of misinterpretated p-values and unfulfilled methodological assumptions, tend to be perceived as a questioning of the investigator's biomedical expertise, which does not facilitate methodological improvement.
Statistical significance has nothing to do with practical importance or scientific relevance; statistical significance reflects sampling uncertainty. Moreover, the number of statistically significant findings that can be expected in a study is related to the study design, not least sample size, number of statistical tests peformed, and strategy used for addressing multiplicity issues.
Successfull investigators develop the design of their experiments in a way that enables detection of practically important and scientifically relevant differences or effects. Parts of such a development are a sample size calculation based on a reasonable estimate of what is practically important, a procedure for data collection that prevents selection bias and confounding, and a strategy for addressing multiplicity issues. In observational research, similar problems have to be resolved in the statistical analysis instead of in the study design. However, entirely disregarding these problems and just interpreting statistical significance as an indication of practical importance and statistical non-significance as an indication of equivalence reflects a fundamental misunderstanding. P-values are not, and have never been, a substitute for scientific reasoning.
Using correct terminology is important for avoiding misunderstandings. For example, the terms univatiate and multivariate are often misunderstood. The terms refer to the type of probability distribution a model is based on. A univariate statistical model is based on a univariate probability distribution, i.e. it has one outcome variable, and a multivariate analysis is based on a multivariate probability distribution, i.e. the model has multiple outcome variables. An ANOVA model, for example, is univariate and has one outcome variable, but a MANOVA model is multivariate because it has more than one outcome variable.
A regression model can have one or more regressors. A regression analysis with one outcome variable and one regressor is known as a simple regression analysis; with multiple regressors it is a multiple regression analysis. In order to change the common misuse of the description of "multivariate" for univariate multiple regression models, the term "multivariable" has been coined. This term just says that the statistical model includes multiple variabels. In analogy, a simple regression model should have been called bivariable, but it is described as a univariable model.
In summary, even if it is possible to analyse a multivariate multivariable statistical model, most multivariable models are univariate.
Many authors are confused about whether the purpose of writing a research report is to describe or to generalise. Some authors also seem to believe that p-values and confidence intervals are descriptive measure that must be used to describe the importance of what has been observed in a studied group of subjects. This is not the case; p-values and confidence intervals describe generalisation uncertainty. However, the question is more complicated than this. Generalising with the help of p-values and confidence intervals must sometimes be performed differently depending on the type of population studied.
Most randomised trials, cohort studies, and case-control studies are not performed for the participating subjects themselves but for the benefit of future patients, an infinite population. A survey, on the other hand, is usually performed to learn about a finite population defined in time and space. While an infinite population only can be studied using samples, a finite population can be studied both with samples and censuses. Analysing a sample from a finite population may require different calculations than a sample from an infinite population.
A sample drawn from an infinite population is usually considered to be a simple random sample. Surveys usually have a more complicated sampling design, and this needs to be accounted for in the analysis. A finite population correction (FPC) may also be necessary. Ignoring the sampling design and analysing survey data as if they had been collected as a simple random sample, is likely to yield too small standard errors, too narrow confidence intervals, and too low p-values.
Another, more philosophical, difference is that the analysis of a sample from an infinite population prioritises the internal validity (i.e. an unbiased description of cause-effect relationships between variables). For survey data, the aim is instead to achieve as high external validity (i.e. an unbiased describtion of the populations' properties) as possible.
Statistical models are in observational clinical research primarily used for developing algorithms for individual prediction and for estimating average effects of treatments or of exposure to hazardous agents, and, which is confusing for many authors, these two modelling purposes require different methodological approaches.
While the best prediction model is the model that predicts best (whether or not the parameter estimates are biased is irrelevant) and this is evaluated using the area under the ROC-curve, the best explanatory model is the one with the least biased parameter estimates (prediction accuracy is irrelevant). This requiers considerations regarding cause-effect relationships. For example, confounders are included in the model to reduce confounding bias, but including a factor on the pathway between cause and effect would be a mistake because this would induce adjustment bias.
It is usually wise to avoid presenting risk estimates from prediction models and predictions from explanatory models.
The new version of Stata (release 16) includes LASSO regression. This is excellent because LASSO is one of the better methods for developing prediction and classification models. However, like stepwise regression, it is unfit for producing parameter estimates with adjustment for confounding bias. The adjustment must be based on considerations regarding cause-effect relations (i.e. confounders must be included in, and mediators and colliders excluded from, the statistical model used for the estimation). This information cannot be derived from data.
I fear that we will soon see publications with LASSO regression being used for the wrong purpose.
Addition July 7, 2019
Citation from Stata:
"Lasso is intended for prediction and selects covariates that are jointly correlated with the variables that belong in the best-approximating model. Said differently, lasso estimates the variables that belong in the model. Like all estimation, this is subject to error. However you put it, the inference methods are robust to these errors if the true variables are among the potential control variables that you specify."
The condition "if the true variables are among the potential control variables that you specify" is crucial. The last sentence should be read with the emphasis on "you specify". Don't expect that the Lasso method can help you.
Some manuscripts are based on a detailed description of a series of patients and have a conclusion restricted to what has been observed. This is what could be expected of a case-series report. However, the same patients could also have been considered a random sample drawn from and representing a greater population of patients, perhaps including future ones. In this case, the findings cannot be directly generalised to the greater population because of sampling uncertainty. When attempting to describe the underlying parameters of the population (including the sample), the uncertainty needs to be presented. This is what p-values and confidence intervals are used for. The inclusion of these measures in a descriptive report, written without any ambition to evaluate underlying mechanisms or effects, indicates methodological confusion.
A study hypothesis can usually not be evaluated by a statistical test of a single null hypothesis. The study may be based on comparisons of more than two groups and of more than one endpoint. Several, perhaps hundreds of statistical tests can then be found in a manuscript, and when multiple null hypotheses are tested, the false positive risk increases with the number of tested hypotheses. In confirmatory studies the significance level may need to be corrected for the multiplicity. One often used method is named after an Italian statistician, Bonferroni.
One common misunderstanding is that all multiplicity problems are solved by correcting the significance level for the number of group comparisons. This leads, however, to an insufficient correction when multiple endpoints are ignored.
It is a common belief that a statistically significant finding always is practically important and that statistical nonsignificance is a good indication of "no difference". This belief is a major mistake. Statistical significance is a measure of uncertainty, not of importance; practical importance has to be shown by other means than p-values. Equivalence and non-inferiority can only be statistically tested when an equivalence or non-inferiority margin, specifying the practical importance, has been defined.
When evaluating the effect of a treatment, it may be tempting to perform the treatment on a group of subjects that have scored extremely on some measurement, and then measure the subjects again after the treatment. The difference in the measured values would provide a good estimate of the treatment effect. Or wouldn't it?
The answer is, that if measurement errors and accidental variation affect the measurements randomly, more subjects will be included with too high values than with too low, and at the next measurement they will in general not be as unlucky; their measured values will tend to be less extreme. This statistical phenomenon is known as regression to the mean. The only practically way to properly account for such regression effects is to include a control group selected using the same criteria as the treated group. Treatment effects can then be separated from regression effects in the statistical analysis.
Given that only three quartiles are defined, the middle one also known as the median (see The International Statistical Institute. The Oxford Dictionary of Statistical Terms. Oxford University Press, New York 2003), it is surprisingly common to see results presented with four quartiles. The explanation is, of course, that the term is misunderstood. The misunderstanding is actually so common that the Merriam-Webster dictionary states that the four quartiles are the same as the four quarts defined by the three quartiles. Confusing? While the exact definition may be of minor importance when writing fiction, avoiding misunderstandings is a crucial part of scientific writing. Stick to the statistical definition of statistical terms.
Statistical reviewing is performed for the benefit of the reader. The main purpose is to make sure that the limitations imposed upon a study's findings by the authors' data collection, study design, and statistical analysis are clearly presented to the reader. Spending several hours on writing rebuttal letters in order to avoid addressing problematic issues in the manuscript is a bad idea.
The common claim that a study "proves" or "demonstrates" the correctness of a study hypothesis usually reveals the authors' readiness to exagerate the importance of their findings. The words "indicates" or "suggests" are aften more appropriate, and their usage may give the reader a better impression of the authors' judgement.
During the last 25 years I have reviewed over 6000 research reports submitted to more than 75 different scientific journals, most of them medical. Some of these reports have presented reliable and groundbreakning research results, but far too many have just represented a serious waste of resources that could have been used to find better medical treatments, reduce suffering, and prolong life.
This experience has changed my life; trying to identify empirical support for presented findings but finding misunderstandings, inconsistencies, and obscurities, and trying to explain the discovered problems to mostly non-statistically oriented authors, have developed me as a statistician, and statistics is a difficult subject.
The experience has also made me sceptical. The recent irreproducibility crisis in medical research did not come as a surprise, and I expect that it soon will spread into many other scientific fields. An inadequate statistical education in combination with a rapid development of computational resources and a widespread "publish or perish" culture, have not only resulted in frequent mistakes and errors; an even worse problem is a systematic exaggeration of the presented findings and the diminishment of their limitations and uncertainty.
There is only solution to these problems, a better understanding of fundamental methodological principles among authors and readers. To that end, I have started this blog with comments inspired by my daily reviewing. These comments do not reveal any content of the reviewed reports; manuscripts submitted for publication are confidential.
Randomised trials and other clinical studies are performed to learn about the effects of a drug or treatment among all patients, not least future ones and not just the ones included in the study. Generalisation from a small sample to an infinite population can, however, not be made without uncertainty, and the magnitude of this uncertainty is often presented in terms of p-values. These values depend on sample size and variability and have, in themselves, nothing to do with clinical relevance or scientific importance. Statistically significant findings are therefore not necessarily clinically important. The evidence for a statistically significant finding's clinical importance remains to be shown, and the interpretation of statistical nonsignificance as evidence of "no difference" is a common mistake. Furthermore, comparing the statistical significance of a factor studied by several researchers, some finding significance, others not, is not meaningful without considerations regarding the effects of sample size and heterogeneity of the studied patients.
It is important to be specific when writing a scientific manuscript. For example, are you presenting a descriptive study or do you wish to test a hypothesis. If you have a study hypothesis, what is it? How has it been operationalised into one or more statistically testable null hypotheses? Is the study design cross-sectional or longitudinal? In the latter case, is it a cohort study or a case-control study? What associations are you studying? Cause - effect relations are crucial for the analysis. If they are unclear, you have to work with assumptions. Can your estimates be interpreted clinically? This is not always the case with odds ratios and hazard ratios. If the estimates cannot be meaningfully interpreted, they are not useful. Change the statistical method!