DATA ANALYSIS AND VISUALIZATION IN R, WINTER 2025 EDITION



Statistical hypotheses

A statistical hypothesis is an assumption regarding the unknown distribution of the examined feature, the truth of which we want to test on the basis of a sample.

A parametric hypothesis is a hypothesis that concerns only the parameters of an unknown distribution. The remaining hypotheses are called nonparametric.

A simple hypothesis specifies an exact value for the parameter. A composite hypothesis specifies a range of values.

In practice, we never consider only one hypothesis but we test one hypothesis against another.

Always two, there are. No more. No less. - Master Yoda

The hypothesis of our interest is called alternative and we denote it as \(H_1\). We test \(H_1\) against relevant null hypothesis, denoted as \(H_0\).

Examples:


Hypothesis testing

Hypothesis testing involves five steps:

  1. Formulating the null and alternative hypotheses. If we reject \(H_0\), we accept \(H_1\).

  2. Calculation the boundaries of rejection region(s) \(W\) for a given test and assumed significance level \(\alpha\).

  3. Collecting an appropriate sample for the test \(X_1,X_2,X_3,\dots,X_n\).

  4. Calculation of the test statistic, i.e. the sample function \(\delta(X_1,X_2,X_3,\dots,X_n)\) and/or p-value.

  5. We reject \(H_0\) if the value of the test statistic \(\delta\) lies in the rejection region \(W\) or if the p-value is less than or equal to the assumed significance level \(\alpha\). Otherwise, we say that there are no grounds to reject the null hypothesis.

Why do we need statistical tests at all? Can’t we just compare the relevant numbers and accept or reject the alternative hypothesis on that basis? We could if we could measure the entire population. Most often, this is impossible and the only thing we have is a finite sample from the population.


Types of errors in hypothesis testing

When drawing conclusions about the entire population based on a sample, we can make one of two mistakes:

Reject \(H_0\) Do not reject \(H_0\)
\(H_0\) is true Type I error No error
\(H_0\) is false No error Type II error

Unfortunately, for a given test statistic, if we reduce the probability of a type I error (e.g. by accepting \(H_0\) with weak evidence of its validity), the probability of a type II error increases.

In practice, we set in advance the maximum value for the probability of a type I error. This value is called the test significance level and is marked with the symbol \(\alpha\) (usually \(\alpha=0.05\), \(\alpha=0.005\), \(\alpha=0.001\), sometimes \(\alpha=0.1\)) .

Power of the test (1-\(\beta\)) is the probability of rejecting the null hypothesis when it is false. The higher the power of the test, the better. Factors affecting power:


Test statistic, rejection region(s) and p-value

The way we calculate the test statistic \(\delta\) and determine the rejection region \(W\) depends on the specific statistical test. However, it is worth remembering that the value of the test statistic is a function of the sample and does not depend on the significance level \(\alpha\), unlike the rejection region \(W\), the boundaries of which are determined as quantiles of order \(1-\alpha\) of the corresponding probability distributions, and which does not depend on the sample data.

The most difficult concept to understand is the p-value. The definition says that it is the smallest value of the significance level at which for a given value of the test statistic we reject the null hypothesis \(H_0\). It follows that the p-value depends on the value of the test statistic (i.e. it depends on the sample).

P-value is the probability of obtaining results from a sample or more extreme results, assuming that the null hypothesis is true. In other words, p-value is the probability that a phenomenon observed in the data could have occurred by chance and that such a phenomenon does not happen in the entire population.


Further reading

The following books provide more information about tests of statistical hypotheses and can be found on the Internet:

  1. Probability and Statistics for Engineering and the Sciences, J. L. Devore
  2. Probability and Statistical Inference, R. V. Hogg, E. A. Tanis, D. L. Zimmerman
  3. Statistics Using R with Biological Examples, K. Seefeld, E. Linder


\(\chi^2\) test

One of the most frequently used tests to check the compliance of experimental data with theoretical ones. In the R language, it is implemented by the function chisq.test(x, p), in which we give the number of counts (histogram) as the first argument, and theoretical probability values as p. The null hypothesis \(H_0\) in this case is that the number of counts divided by the total number of counts equals the theoretical probability values.

x <- c (100,50,20,10,5)
p1 <- c (0.5, 0.2, 0.2, 0.05, 0.05)
chisq.test (x, p = p1)
## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 15, df = 4, p-value = 0.004701
p2 <- c (0.5, 0.3, 0.1, 0.05, 0.05)
chisq.test (x, p = p2)
## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 3.2883, df = 4, p-value = 0.5108
p <- x/sum(x)
chisq.test (x, p = p)
## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 0, df = 4, p-value = 1
plot (p)
lines (p1, col="red")
lines (p2, col="blue")

NOTE 1: This is a test for discrete distributions only.

NOTE 2: The probabilities must sum to 1, otherwise we will get an error. However, it is possible to rescale using the parameter rescale.p = TRUE.

h <- hist (rpois (200, lam = 3), breaks = (0:15)-0.5)

chisq.test (h$counts, p = dpois (h$mids, lam = 3))
## Error in chisq.test(h$counts, p = dpois(h$mids, lam = 3)): probabilities must sum to 1.
chisq.test (h$counts, p = dpois (h$mids, lam = 3), rescale.p = TRUE)
## Warning in chisq.test(h$counts, p = dpois(h$mids, lam = 3), rescale.p = TRUE):
## Chi-squared approximation may be incorrect
## 
##  Chi-squared test for given probabilities
## 
## data:  h$counts
## X-squared = 12.115, df = 14, p-value = 0.5971
plot (h$mids, h$density, ylim = range(h$density, dpois (h$mids, lam = 3)))
lines (h$mids, dpois (h$mids, lam = 3), col="red", lwd=2)


Kolmogorov-Smirnov test

The Kolmogorov–Smirnov statistic quantifies the distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution (but only continuous) or between the empirical distribution functions of two samples (can be discrete, continuous, or mixed). The null hypothesis is that the sample is drawn from the reference distribution (in the one-sample case) or the samples are drawn from the same distribution (in the two-sample case).

x <- rnorm (100)
y <- rnorm (100)
ks.test (x, y)
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.13, p-value = 0.3667
## alternative hypothesis: two-sided
y <- rnorm (100, 1, 2)
ks.test (x, y)
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.37, p-value = 2.267e-06
## alternative hypothesis: two-sided
ks.test (x, "pnorm", 0, 1)
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.11545, p-value = 0.139
## alternative hypothesis: two-sided
x <- rpois (100, lam = 5)
y <- rpois (100, lam = 5)
ks.test (x, y)
## Warning in ks.test.default(x, y): p-value will be approximate in the presence
## of ties
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.08, p-value = 0.9062
## alternative hypothesis: two-sided
y <- rpois (100, lam = 2)
ks.test (x, y)
## Warning in ks.test.default(x, y): p-value will be approximate in the presence
## of ties
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  x and y
## D = 0.54, p-value = 4.335e-13
## alternative hypothesis: two-sided
# The result of the following test will be incorrect because the Poisson distribution is discrete
ks.test (x, "ppois", lam = 5)
## Warning in ks.test.default(x, "ppois", lam = 5): ties should not be present for
## the one-sample Kolmogorov-Smirnov test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.17596, p-value = 0.00409
## alternative hypothesis: two-sided


Test of normality and location

Apart to Kolmogorov-Smirnov, the Shapiro-Wilk test is another very robust statistical test used to determine whether data comes from a normal distribution.

shapiro.test (rnorm (100))
## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(100)
## W = 0.98967, p-value = 0.6376
shapiro.test (rnorm (100, 0, 5))
## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(100, 0, 5)
## W = 0.98798, p-value = 0.5064
shapiro.test (runif (100, 0, 1))
## 
##  Shapiro-Wilk normality test
## 
## data:  runif(100, 0, 1)
## W = 0.94475, p-value = 0.0003803

Location parameter is a parameter of probability distributions, the change of which causes a shift of the distribution function and the probability distribution function of a given distribution without changing their shape. To test if two samples come from the distributions with the same location parameter or to test the location a distribution of one sample it is recommended to use the Wilcoxon test.

wilcox.test (rnorm (100))
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  rnorm(100)
## V = 2730, p-value = 0.482
## alternative hypothesis: true location is not equal to 0
wilcox.test (rnorm (100, mean = 10, sd = 2), mu = 10)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  rnorm(100, mean = 10, sd = 2)
## V = 2222, p-value = 0.2983
## alternative hypothesis: true location is not equal to 10
wilcox.test (rnorm (100, mean = 10, sd = 2), mu = 2)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  rnorm(100, mean = 10, sd = 2)
## V = 5050, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 2
wilcox.test (rnorm (100, mean = 1, sd = 0.5), rnorm (100, mean = 2, sd = 2))
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rnorm(100, mean = 1, sd = 0.5) and rnorm(100, mean = 2, sd = 2)
## W = 3426, p-value = 0.0001207
## alternative hypothesis: true location shift is not equal to 0
wilcox.test (rcauchy (100, location = 1))
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  rcauchy(100, location = 1)
## V = 3749, p-value = 2.59e-05
## alternative hypothesis: true location is not equal to 0
wilcox.test (rcauchy (100, location = 1, scale = 1), rcauchy (100, location = 1, scale = 2))
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  rcauchy(100, location = 1, scale = 1) and rcauchy(100, location = 1, scale = 2)
## W = 4666, p-value = 0.4151
## alternative hypothesis: true location shift is not equal to 0


Fitting univariate distributions

Using the fitdistr() function (MASS library) it is possible to estimate unknown parameters of distributions - it can be either one of the built-in distributions or our own probability density function. In the case of some distributions, e.g. normal or exponential, it is not necessary to provide initial values for the searched parameters.

library (MASS)

x <- rnorm (500, mean = 5, sd = 1.5)
fit1 <- fitdistr (x, "normal")
fit1
##       mean          sd    
##   4.97722553   1.62570730 
##  (0.07270384) (0.05140938)
y <- rexp (500, rate = 0.5)
fit2 <- fitdistr (y, "exponential")
fit2
##      rate   
##   0.4711439 
##  (0.0210702)
z <- rweibull (500, scale = 1.5, shape = 3)
fit3 <- fitdistr (z, "weibull", list (scale = 1, shape = 1))
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
fit3
##      scale        shape   
##   1.51401980   3.15008411 
##  (0.02262569) (0.11109156)

The object returned by the function has the following fields: estimate, sd and loglik. The first two provide the estimate of the distribution parameters and their standard deviations, respectively, the last one is the logarithm of the likelihood function. This function is maximized in an iterative process of adjusting the distribution parameters.

fit1$estimate
##     mean       sd 
## 4.977226 1.625707
fit1$sd
##       mean         sd 
## 0.07270384 0.05140938
fit1$loglik
## [1] -952.4408

After fitting, one can verify the hypothesis whether the data actually come from distributions with estimated parameters.

# Normality tests
shapiro.test (x)
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.99886, p-value = 0.9893
ks.test (x, "pnorm", mean = fit1$estimate["mean"], sd = fit1$estimate["sd"])
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  x
## D = 0.017615, p-value = 0.9978
## alternative hypothesis: two-sided
# Exponential distribution test
ks.test (y, "pexp", rate = fit2$estimate["rate"])
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  y
## D = 0.030728, p-value = 0.7326
## alternative hypothesis: two-sided
# Weibull distribution test
ks.test (z, "pweibull", scale = fit3$estimate["scale"], shape = fit3$estimate["shape"])
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  z
## D = 0.019247, p-value = 0.9925
## alternative hypothesis: two-sided
# Plots
library (ggplot2)
g <- ggplot (data.frame (x,y,z))

g + 
  geom_histogram (aes(x = x, y = after_stat (density)), 
                  fill="dodgerblue", 
                  colour="black", 
                  bins = 25,
                  alpha=0.4) +
  stat_function (fun = dnorm,
                 args = as.list (fit1$estimate),
                 colour="red",
                 linewidth=1.5)

g + 
  geom_histogram(aes(x = y, y = after_stat (density)),
                 fill="limegreen",
                 colour="black",
                 bins = 25,
                 alpha=0.4) + 
  stat_function (fun = dexp,
                 args = as.list (fit2$estimate),
                 colour="red",
                 linewidth=1.5)

g + 
  geom_histogram (aes (x = z, y = after_stat (density)),
                  fill="gold",
                  colour="black",
                  bins = 25,
                  alpha=0.4) + 
  stat_function (fun = dweibull,
                 args = as.list (fit3$estimate),
                 colour="red",
                 linewidth=1.5)