Hello

Jantine Broek

Computational and AI/ML Scientist



by Jantine Broek

Statistical Inference & Reproducibility
Part I: Statistical Inference

Scientists have struggled to reproduce experiments. Science ability to reproduce experiments is a core value, however, when experiments cannot be replicated, their conclusions cannot be confirmed and postulated theories upon which those conclusions rely are hard to prove. Reproducibility refers to drawing a qualitatively similar conclusion from an entire study, and can be separated in different areas, such as methodological, empirical, computational and statistical or inferential. Science thrives when researchers verify the results of others and using reproducibility is a way to establish the truth of a research claim.

In other words, when a finding can be reliably reproduced, it is likely to be true, however, when it cannot be reproduced, the claims made based on the finding will be disputed. One of the causes of poor reproducibility is said to be the increased complexity of experiments and statistics, and the lack of proper understanding of the limitations of statistical significance as a criterion for knowledge claims. Another cause is the incomplete or selective reporting analytical procedure and choices. However, this could also be seen as a secondary effect of the lack of proper understanding of statistical inference, as this knowledge is needed to know what to report, which models were considered and which inference steps were taken to get to the outcome.

In questions of science, the authority of a thousand is not worth the humble reasoning of a single individual.

- Galileo Galilei

This blog series will address some key issues of reproducibility and will provide knowledge about statistical inference strategies that are important to take into account when you use this in your own work. In order to discuss the relation between reproducibility and statistical inference, we first need to understand what statistical inference is and what the differences are between the two main inference approaches: Frequentist and Bayesian. This insight will not only allow you to increase the reproducibility of your own work, but also allows you to quickly evaluate the weight of evidence in any other study. When more scientists are informed about strategies to improve reproducibility, research transparency will increase and a higher proportion of studies will address knowledge gaps instead of exploring blind alleys that were caused by a lack of transparency and reproducibility.

Statistical inference

In general, to make an inference is to reach a conclusion by taking steps in reasoning that are based on premises. When we make a statistical inference, we reach conclusions about some population for which we haven’t or can’t collect all of the data. The population distribution is assumed to represent the distribution of all units or "total information". To reach a conclusion about the population, we learn from a subset of cases, or a sample. For example, we might infer the average height of giraffes in Botswana, overall, by sampling from that population and looking at the average height of giraffes in our smaller sample. Naturally, we won’t assume that the average height of giraffes in the overall population is the same as the average height of giraffes in our sample. But, we use the sampled data and the formal rules of inference to characterize exactly what we can say about the larger population. We may be interested in different properties of the population distribution such as the mean, median, or standard deviation of some measurement like height, years of formal education, or prevalence of a genetic mutation. These descriptive properties of a model for the population distribution, such as the mean, are called parameters. A parameter is simply a summary characteristic of a population distribution.

sampleDistr
The top panel shows the population distribution for continuous variables, with a parameter mean, μ, of 50 and a standard deviation, 𝛔, of 10. From this population, we draw sample data (second panel; look at how the population mean μ and the sample mean x̄ of one data sample compare) that has a distribution called the sample distribution (third panel). In this case, just 1 sample was drawn, therefore there is no standard deviation (std dev) in the sampling distribution of the sample mean. The gray line in the third panel is an overlay of the normal distribution, which the sample distribution is expected to follow after sampling more data from the population distribution. These graphs are from Sampling Distribution for the Sample Mean from Art of Stats. Click on the link and play around with the many different population distributions and explore the sampling distribution. For example, draw multiple sample data and see how the sample distribution will follow the normal distribution.

Unless we have every piece of possible data regarding what we are looking at we cannot know the parameters exactly. We can estimate the parameters by sampling data from the population. These samples of the population also form a distribution called the sample distribution, which is the distribution of the observations that we actually make. Because we are only interested in sample data which is representative of the populations from which they have been drawn, we assume that the sample distribution is a reflection of the population distribution. Therefore, the statistical model chosen to represent the sample distribution is similar to the presumed statistical model for the population distribution. The parameters are estimated from sample data using quantities called statistics. A statistic is a random variable, which means it takes different values for different samples. The meaning of the term “statistic” is parallel to the meaning of the term “parameter”: they both characterize distributions, but only the values of the statistics are known. Let’s go back to our giraffe example. Here we took a sample of the total population of giraffes in Botswana, as there is a good chance that we are not able to measure all the giraffes in Botswana. When we calculate our sample mean, , and standard deviation, s, we are calculating the statistics. We assume that the sample is quite representative for the population, and therefore, with the statistics calculated from the sample, we can approach the population parameter mean, μ, and standard deviation, 𝛔. These population parameters are the true values we want to know and the parameter μ represents the average height inferred for all giraffes in Botswana. To make it easier to remember, population and parameter both starts with a p, while sample and statistics both start with an s.

There are two main difficulties with using sample data sets. First, you have to generalize from a finite data set, as you draw from just a subset of the population. Second, the observed data is corrupted with noise. The noise is an error in the model, as the model abstracts from the real world by simplifying things. To account for that simplification, we add an error term in the model of the sample distribution that captures the noise of the measurement. Therefore, there is some uncertainty in estimating the population parameter from sample data. The assumptions made for the population from which you drew sample data are the first set of premises in statistical inference. This leads to the selection of a statistical model that represents the data collection process.

ErrorsIandII
This is a visualization of how the Type I and Type II error changes depending on the μ of the Null Hypothesis, the μ of the Alternative Hypothesis, sample size, standard deviation and Type I Error α (here α is 0.05). The red shaded area is the Type I error, and the blue shaded area is the Type II error. In the Errors and Power app you can change the variables to see how this affects the Type I and Type II error (courtesy of Art of Stats).

Another set of premises are the propositions you make based on the model, or the hypothesis about the value of a population parameter, such as those that give rise to the Null Hypothesis, which states that there is no relationship between two groups (or no association among groups). After proposing a hypothesis, you will deduce the properties of the underlying distribution given your sample data. The conclusion of a statistical inference is a statistical proposition, such as a point estimate that best approximates some parameters of interest, an interval estimate (such as a confidence interval), the rejection of a hypothesis, or the classification of data points into groups.

Nowadays, the assumptions made for statistical inference can be roughly split into two kinds: between Frequentist and Bayesian inference. This diversion came about during the previous century when computers grew in their computing power alongside more knowledge about probability distributions of populations. Although these two views are sometimes described as opposing, they share some overlap and different research hypotheses lend themselves better to either Frequentist inference or Bayesian inference. In the next post, we will continue discussing statistical inference and reproducibility by zooming in on strategies for Frequentist inference and Bayesian inference. Knowing these two different approaches is essential to understand what each strategy tells you about your data and how you can use these skills to obtain better reproducible studies.

November 2017
 

Back to Blogs