::p_load(
pacman# File import
rio, # File locator
here, # get overview of data
skimr, # data management + ggplot2 graphics,
tidyverse, # summary statistics and tests
gtsummary, # statistics
rstatix, # correlation analayis for numeric variables
corrr, # adding totals and percents to tables
janitor, # converting tables to HTML
flextable )
18 Simple statistical tests
This page demonstrates how to conduct simple statistical tests using base R, rstatix, and gtsummary.
- T-test
- Shapiro-Wilk test
- Wilcoxon rank sum test
- Kruskal-Wallis test
- Chi-squared test
- Correlations between numeric variables
…many other tests can be performed, but we showcase just these common ones and link to further documentation.
Each of the above packages bring certain advantages and disadvantages:
- Use base R functions to print a statistical outputs to the R Console
- Use rstatix functions to return results in a data frame, or if you want tests to run by group
- Use gtsummary if you want to quickly print publication-ready tables
18.1 Preparation
Load packages
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on R basics for more information on R packages.
Import data
We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import()
function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the Import and export page for details).
# import the linelist
<- import("linelist_cleaned.rds") linelist
The first 50 rows of the linelist are displayed below.
18.2 base R
You can use base R functions to conduct statistical tests. The commands are relatively simple and results will print to the R Console for simple viewing. However, the outputs are usually lists and so are harder to manipulate if you want to use the results in subsequent operations.
T-tests
A t-test, also called “Student’s t-Test”, is typically used to determine if there is a significant difference between the means of some numeric variable between two groups. Here we’ll show the syntax to do this test depending on whether the columns are in the same data frame.
Syntax 1: This is the syntax when your numeric and categorical columns are in the same data frame. Provide the numeric column on the left side of the equation and the categorical column on the right side. Specify the dataset to data =
. Optionally, set paired = TRUE
, and conf.level =
(0.95 default), and alternative =
(either “two.sided”, “less”, or “greater”). Enter ?t.test
for more details.
## compare mean age by outcome group with a t-test
t.test(age_years ~ gender, data = linelist)
Welch Two Sample t-test
data: age_years by gender
t = -21.344, df = 4902.3, p-value < 2.2e-16
alternative hypothesis: true difference in means between group f and group m is not equal to 0
95 percent confidence interval:
-7.571920 -6.297975
sample estimates:
mean in group f mean in group m
12.60207 19.53701
Syntax 2: You can compare two separate numeric vectors using this alternative syntax. For example, if the two columns are in different data sets.
t.test(df1$age_years, df2$age_years)
You can also use a t-test to determine whether a sample mean is significantly different from some specific value. Here we conduct a one-sample t-test with the known/hypothesized population mean as mu =
:
t.test(linelist$age_years, mu = 45)
Shapiro-Wilk test
The Shapiro-Wilk test can be used to determine whether a sample came from a normally-distributed population (an assumption of many other tests and analysis, such as the t-test). However, this can only be used on a sample between 3 and 5000 observations. For larger samples a quantile-quantile plot may be helpful.
shapiro.test(linelist$age_years)
Wilcoxon rank sum test
The Wilcoxon rank sum test, also called the Mann–Whitney U test, is often used to help determine if two numeric samples are from the same distribution when their populations are not normally distributed or have unequal variance.
## compare age distribution by outcome group with a wilcox test
wilcox.test(age_years ~ outcome, data = linelist)
Wilcoxon rank sum test with continuity correction
data: age_years by outcome
W = 2501868, p-value = 0.8308
alternative hypothesis: true location shift is not equal to 0
Kruskal-Wallis test
The Kruskal-Wallis test is an extension of the Wilcoxon rank sum test that can be used to test for differences in the distribution of more than two samples. When only two samples are used it gives identical results to the Wilcoxon rank sum test.
## compare age distribution by outcome group with a kruskal-wallis test
kruskal.test(age_years ~ outcome, linelist)
Kruskal-Wallis rank sum test
data: age_years by outcome
Kruskal-Wallis chi-squared = 0.045675, df = 1, p-value = 0.8308
Chi-squared test
Pearson’s Chi-squared test is used in testing for significant differences between categorical croups.
## compare the proportions in each group with a chi-squared test
chisq.test(linelist$gender, linelist$outcome)
Pearson's Chi-squared test with Yates' continuity correction
data: linelist$gender and linelist$outcome
X-squared = 0.0011841, df = 1, p-value = 0.9725
18.3 rstatix package
The rstatix package offers the ability to run statistical tests and retrieve results in a “pipe-friendly” framework. The results are automatically in a data frame so that you can perform subsequent operations on the results. It is also easy to group the data being passed into the functions, so that the statistics are run for each group.
Summary statistics
The function get_summary_stats()
is a quick way to return summary statistics. Simply pipe your dataset to this function and provide the columns to analyse. If no columns are specified, the statistics are calculated for all columns.
By default, a full range of summary statistics are returned: n, max, min, median, 25%ile, 75%ile, IQR, median absolute deviation (mad), mean, standard deviation, standard error, and a confidence interval of the mean.
%>%
linelist ::get_summary_stats(age, temp) rstatix
# A tibble: 2 × 13
variable n min max median q1 q3 iqr mad mean sd se
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age 5802 0 84 13 6 23 17 11.9 16.1 12.6 0.166
2 temp 5739 35.2 40.8 38.8 38.2 39.2 1 0.741 38.6 0.977 0.013
# ℹ 1 more variable: ci <dbl>
You can specify a subset of summary statistics to return by providing one of the following values to type =
: “full”, “common”, “robust”, “five_number”, “mean_sd”, “mean_se”, “mean_ci”, “median_iqr”, “median_mad”, “quantile”, “mean”, “median”, “min”, “max”.
It can be used with grouped data as well, such that a row is returned for each grouping-variable:
%>%
linelist group_by(hospital) %>%
::get_summary_stats(age, temp, type = "common") rstatix
# A tibble: 12 × 11
hospital variable n min max median iqr mean sd se ci
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Central Hos… age 445 0 58 12 15 15.7 12.5 0.591 1.16
2 Central Hos… temp 450 35.2 40.4 38.8 1 38.5 0.964 0.045 0.089
3 Military Ho… age 884 0 72 14 18 16.1 12.4 0.417 0.818
4 Military Ho… temp 873 35.3 40.5 38.8 1 38.6 0.952 0.032 0.063
5 Missing age 1441 0 76 13 17 16.0 12.9 0.339 0.665
6 Missing temp 1431 35.8 40.6 38.9 1 38.6 0.97 0.026 0.05
7 Other age 873 0 69 13 17 16.0 12.5 0.422 0.828
8 Other temp 862 35.7 40.8 38.8 1.1 38.5 1.01 0.034 0.067
9 Port Hospit… age 1739 0 68 14 18 16.3 12.7 0.305 0.598
10 Port Hospit… temp 1713 35.5 40.6 38.8 1.1 38.6 0.981 0.024 0.046
11 St. Mark's … age 420 0 84 12 15 15.7 12.4 0.606 1.19
12 St. Mark's … temp 410 35.9 40.6 38.8 1.1 38.5 0.983 0.049 0.095
You can also use rstatix to conduct statistical tests:
T-test
Use a formula syntax to specify the numeric and categorical columns:
%>%
linelist t_test(age_years ~ gender)
# A tibble: 1 × 10
.y. group1 group2 n1 n2 statistic df p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 age_… f m 2807 2803 -21.3 4902. 9.89e-97 9.89e-97 ****
Or use ~ 1
and specify mu =
for a one-sample T-test. This can also be done by group.
%>%
linelist t_test(age_years ~ 1, mu = 30)
# A tibble: 1 × 7
.y. group1 group2 n statistic df p
* <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 age_years 1 null model 5802 -84.2 5801 0
If applicable, the statistical tests can be done by group, as shown below:
%>%
linelist group_by(gender) %>%
t_test(age_years ~ 1, mu = 18)
# A tibble: 3 × 8
gender .y. group1 group2 n statistic df p
* <chr> <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 f age_years 1 null model 2807 -29.8 2806 7.52e-170
2 m age_years 1 null model 2803 5.70 2802 1.34e- 8
3 <NA> age_years 1 null model 192 -3.80 191 1.96e- 4
Shapiro-Wilk test
As stated above, sample size must be between 3 and 5000.
%>%
linelist head(500) %>% # first 500 rows of case linelist, for example only
shapiro_test(age_years)
# A tibble: 1 × 3
variable statistic p
<chr> <dbl> <dbl>
1 age_years 0.917 6.67e-16
Wilcoxon rank sum test
%>%
linelist wilcox_test(age_years ~ gender)
# A tibble: 1 × 9
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 age_years f m 2807 2803 2829274 3.47e-74 3.47e-74 ****
Kruskal-Wallis test
Also known as the Mann-Whitney U test.
%>%
linelist kruskal_test(age_years ~ outcome)
# A tibble: 1 × 6
.y. n statistic df p method
* <chr> <int> <dbl> <int> <dbl> <chr>
1 age_years 5888 0.0457 1 0.831 Kruskal-Wallis
Chi-squared test
The chi-square test function accepts a table, so first we create a cross-tabulation. There are many ways to create a cross-tabulation (see Descriptive tables) but here we use tabyl()
from janitor and remove the left-most column of value labels before passing to chisq_test()
.
%>%
linelist tabyl(gender, outcome) %>%
select(-1) %>%
chisq_test()
# A tibble: 1 × 6
n statistic p df method p.signif
* <dbl> <dbl> <dbl> <int> <chr> <chr>
1 5888 3.53 0.473 4 Chi-square test ns
Many many more functions and statistical tests can be run with rstatix functions. See the documentation for rstatix online here or by entering ?rstatix.
18.4 gtsummary
package
Use gtsummary if you are looking to add the results of a statistical test to a pretty table that was created with this package (as described in the gtsummary section of the Descriptive tables page).
Performing statistical tests of comparison with tbl_summary
is done by adding the add_p
function to a table and specifying which test to use. It is possible to get p-values corrected for multiple testing by using the add_q
function. Run ?tbl_summary
for details.
Chi-squared test
Compare the proportions of a categorical variable in two groups. The default statistical test for add_p()
when applied to a categorical variable is to perform a chi-squared test of independence with continuity correction, but if any expected call count is below 5 then a Fisher’s exact test is used.
%>%
linelist select(gender, outcome) %>% # keep variables of interest
tbl_summary(by = outcome) %>% # produce summary table and specify grouping variable
add_p() # specify what test to perform
1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_na_value_to_level()` on `outcome` column before passing to `tbl_summary()`.
Characteristic |
Death, N = 2,582 1 |
Recover, N = 1,983 1 |
p-value 2 |
---|---|---|---|
gender | >0.9 | ||
f | 1,227 (50%) | 953 (50%) | |
m | 1,228 (50%) | 950 (50%) | |
Unknown | 127 | 80 | |
1
n (%) |
|||
2
Pearson’s Chi-squared test |
T-tests
Compare the difference in means for a continuous variable in two groups. For example, compare the mean age by patient outcome.
%>%
linelist select(age_years, outcome) %>% # keep variables of interest
tbl_summary( # produce summary table
statistic = age_years ~ "{mean} ({sd})", # specify what statistics to show
by = outcome) %>% # specify the grouping variable
add_p(age_years ~ "t.test") # specify what tests to perform
1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_na_value_to_level()` on `outcome` column before passing to `tbl_summary()`.
Characteristic |
Death, N = 2,582 1 |
Recover, N = 1,983 1 |
p-value 2 |
---|---|---|---|
age_years | 16 (12) | 16 (13) | 0.6 |
Unknown | 32 | 28 | |
1
Mean (SD) |
|||
2
Welch Two Sample t-test |
Wilcoxon rank sum test
Compare the distribution of a continuous variable in two groups. The default is to use the Wilcoxon rank sum test and the median (IQR) when comparing two groups. However for non-normally distributed data or comparing multiple groups, the Kruskal-wallis test is more appropriate.
%>%
linelist select(age_years, outcome) %>% # keep variables of interest
tbl_summary( # produce summary table
statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (this is default so could remove)
by = outcome) %>% # specify the grouping variable
add_p(age_years ~ "wilcox.test") # specify what test to perform (default so could leave brackets empty)
1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_na_value_to_level()` on `outcome` column before passing to `tbl_summary()`.
Characteristic |
Death, N = 2,582 1 |
Recover, N = 1,983 1 |
p-value 2 |
---|---|---|---|
age_years | 13 (6, 23) | 13 (6, 23) | 0.8 |
Unknown | 32 | 28 | |
1
Median (IQR) |
|||
2
Wilcoxon rank sum test |
Kruskal-wallis test
Compare the distribution of a continuous variable in two or more groups, regardless of whether the data is normally distributed.
%>%
linelist select(age_years, outcome) %>% # keep variables of interest
tbl_summary( # produce summary table
statistic = age_years ~ "{median} ({p25}, {p75})", # specify what statistic to show (default, so could remove)
by = outcome) %>% # specify the grouping variable
add_p(age_years ~ "kruskal.test") # specify what test to perform
1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_na_value_to_level()` on `outcome` column before passing to `tbl_summary()`.
Characteristic |
Death, N = 2,582 1 |
Recover, N = 1,983 1 |
p-value 2 |
---|---|---|---|
age_years | 13 (6, 23) | 13 (6, 23) | 0.8 |
Unknown | 32 | 28 | |
1
Median (IQR) |
|||
2
Kruskal-Wallis rank sum test |
18.5 Correlations
Correlation between numeric variables can be investigated using the tidyverse
corrr package. It allows you to compute correlations using Pearson, Kendall tau or Spearman rho. The package creates a table and also has a function to automatically plot the values.
<- linelist %>%
correlation_tab select(generation, age, ct_blood, days_onset_hosp, wt_kg, ht_cm) %>% # keep numeric variables of interest
correlate() # create correlation table (using default pearson)
# print correlation_tab
# A tibble: 6 × 7
term generation age ct_blood days_onset_hosp wt_kg ht_cm
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 generation NA -2.22e-2 0.179 -0.288 -0.0302 -0.00942
2 age -0.0222 NA 0.00849 -0.000635 0.833 0.877
3 ct_blood 0.179 8.49e-3 NA -0.600 -0.00636 0.0181
4 days_onset_hosp -0.288 -6.35e-4 -0.600 NA 0.0153 -0.00953
5 wt_kg -0.0302 8.33e-1 -0.00636 0.0153 NA 0.884
6 ht_cm -0.00942 8.77e-1 0.0181 -0.00953 0.884 NA
## remove duplicate entries (the table above is mirrored)
<- correlation_tab %>%
correlation_tab shave()
## view correlation table
correlation_tab
# A tibble: 6 × 7
term generation age ct_blood days_onset_hosp wt_kg ht_cm
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 generation NA NA NA NA NA NA
2 age -0.0222 NA NA NA NA NA
3 ct_blood 0.179 0.00849 NA NA NA NA
4 days_onset_hosp -0.288 -0.000635 -0.600 NA NA NA
5 wt_kg -0.0302 0.833 -0.00636 0.0153 NA NA
6 ht_cm -0.00942 0.877 0.0181 -0.00953 0.884 NA
## plot correlations
rplot(correlation_tab)
18.6 Resources
Much of the information in this page is adapted from these resources and vignettes online: