Statistical Inference & Reproducibility

by Jantine Broek

Statistical Inference & Reproducibility
Part III: Statistics vs Reproducibility

This blog series addresses the relation between reproducibility and statistical inference. In the previous posts, I introduced statistical inference and an explanation of the two main approaches in inference: Frequentist and Bayesian. With this insight, we can continue our discussion about how statistical inference strategies contribute to reproducibility and how proper reporting of the used statistical inference method add to the transparency of research.

When you search for “reproducibility crisis and statistical inference” on the internet, you will see that this theme has been a center of debate for quite some years. You will also find that different people often point to one of the inference methods as either the cause of or solution to the reproducibility problem. I explained Frequentist and Bayesian inference in the previous post; and you probably saw that these inference methods are not totally separable. Therefore, discussions about choosing one over another is not completely fair, as they share many commonalities. In addition, the inference type depends on the question asked. Generally, when analyzing data and designing experiments, you will use a variety of techniques. Some of these are Frequentist tools, some are Bayesian, some are both, and some don't use probability at all! A casino will do just fine relying exclusively on Frequentist statistics, but a sports team will better avoid overpaying for players that were just lucky scoring a goal by applying a Bayesian method.

Using Frequentist inference

Understanding that choosing an inference method is based on reasoning, argumentation, and stated hypotheses is the first step to improving reproducibility of statistical inference. After choosing the inference type, specifying methods related to the chosen technique also helps improve the reproducibility of a given study.

When using a Frequentist method, it is important to not see effects associated with a p-value of 0.05 and immediately take them as robust evidence of an effect. To avoid the incorrect inferences that come from a hard threshold, you could

(1) change the p-value of 0.05 to 0.005 or lower, or

(2) use the confidence interval (CI) rather than choosing a specific p-value or significance test.

While a more stringent p-value will help guard against false inferences, it may be that neither will provide much value to a hypotheses test.

Probability of truth

A claim is likely to be true when the finding can be reliably repeated. Therefore, it would help if we could assign a probability of truth to a hypothesis. The Frequentist approach does not allow to assign such a probability of truth to a hypothesis before an experiment. In Bayesian inference, we saw that such a probability is assigned to the hypothesis, as the posterior probability is a function of how likely it was to be true before the experiment combined with the prior that contains the strength of the new experimental evidence. Therefore, looking into the prior probability of various kinds of hypotheses may matter, although these might be more difficult to tease apart and model.

Using Bayesian inference

When using the Bayesian method, determining the value of the prior can be very controversial. You can choose an uninformative prior, for which every model has equal probability and, thus, the information from the likelihood is interpreted probabilistically; or an informative prior, which contains collected information about the world and expresses specific, definite information about a variable. With lots of data, all prior methods do a good job in estimating the variable, but these do a pretty poor job with fairly little data. With an intermediate amount of data, a strong informative prior could influence the posterior too much and result in an estimate that does not approach the actual parameter value. In this case, the uninformative prior is more useful because it approaches the parameter much closer to the actual value.

As you might begin to see, implementing and specifying the priors is a difficult task, especially when working with more complicated models with high-dimensional parameters. Choosing an appropriate model for both the Frequentist and Bayesian inference is key. In the Frequentist setting, this will determine the degrees of freedom, and in the Bayesian setting, this influences the outcome of the posterior. Therefore, being clear about model selection procedures will help in establishing reproducible research.

The importance of openness

Openness and transparency about model selection and complete reporting of all relevant aspects of the analysis also helps to prevent other practices of statistical analysis, such as multiplicity, which occurs when you test many hypotheses simultaneously, testing one hypothesis many times or in multiple ways, infer just a subset of parameters (based on the observed values), or any other methods that guarantee a chance observation that will appear to strongly support a hypothesis. There is a diverse nomenclature of these activities, and you might have heard of practices such as “file-drawer publication bias”, which is only publishing “interesting” results while nonsignificant results remain unpublished, “HARKing”, which is defining a hypothesis after the results are known, and “P-hacking”, which is the practice of applying multiple statistical analyses and only reporting a statistical significant result while not completely reporting how this was obtained. Completely reporting the analytical procedure and choices will therefore also help to prevent multiplicity, next to the benefits of being able to reproduce, and therefore support, a statistical claim.

The way forward

Although statistics is a very important ingredient in the reproducibility crisis, many factors other than statistical methodology play a role. Currently, most data analysis is performed by people who are not trained as statisticians. In addition, these people also belong to the scientific community in which the entire reward structure of the scientific enterprise does not work in favor of reproducibility. To be seen as “productive” in terms of number of publications, scientists may feel pressured to cut corners at the expense of rigorous methodology. Having said that, considerable improvement in reproducibility can be achieved by clearly and comprehensively describing statistical analysis methodology in research articles. Often, the methodology is described incompletely, code is not supplied or the appropriate references to code or software are missing, and very little is written about the steps undertaken to perform analysis and statistical inference methods. To create a reproducible study, it is important to strive to meet the following three criteria:

(1) the raw data from the experiment must be available (note: for some human data this is hard, but an example data set that reflects the raw data should be available),

(2) the statistical code and documentation needed to reproduce the analyses must be available, and

(3) correct data analysis steps are taken.

The trickiest part of the current reproducibility crisis is the pressure to produce fast results without reflection on the methods used. Educating scientists about the variety of methods and benefits of open research practice, such as done by COS, are important steps forward.

July 2018

Back to Blogs

Hello

Jantine Broek

Computational and AI/ML Scientist

by Jantine Broek