Appendix: More statistics and visualizations

Correlations

Correlation is a measure of how two variables change together. There are many different variants, the most popular being the Pearson correlation and the Spearman correlation. Both can be calculated using the cor() function in base R and tested for significance using the cor.test(). Here they are in action on the iris dataset:

cor(iris$Sepal.Length, iris$Petal.Length)

[1] 0.8717538

cor.test(iris$Sepal.Length, iris$Petal.Length, method="spearman")

Warning in cor.test.default(iris$Sepal.Length, iris$Petal.Length, method =
"spearman"): Cannot compute exact p-value with ties


    Spearman's rank correlation rho

data:  iris$Sepal.Length and iris$Petal.Length
S = 66429, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.8818981

The warning message above indicates that there are duplicated values in the data, which makes the Spearman correlation test less reliable.

The cor() function can also calculate the correlation matrix – that means, correlate each variable with each. This is useful for visualizing the relationships between variables. Here is how you can do it:

cor(iris[,1:4])

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

The diagonal of the matrix is always 1, because a variable is always perfectly correlated with itself. The matrix is symmetric, because the correlation between \(x\) and \(y\) is the same as the correlation between \(y\) and \(x\). We can visualize this matrix using pheatmap:

library(pheatmap)
M <- cor(iris[,1:4])
pheatmap(M, scale="none")

The scale="none" parameter is used to avoid scaling the data by rows or by columns – it does not make sense for a symmetrical matrix.

As you can see, the one oddball in the iris dataset is the Sepal.Width variable, which is not very correlated with the other three.

There is another package to visualize correlation matrices, called corrplot. It is more flexible and can be used to visualize the correlation matrix in many different ways. Here is an example:

library(corrplot)

corrplot 0.94 loaded

corrplot(M, method="color")

There are many cool ways how corrplot can visualize the correlation matrix. You can check them out in the documentation of the package.

corrplot(M, method="ellipse", 
         type="upper", tl.pos="d")
corrplot(M, add = TRUE, type = 'lower', 
         method = 'number',
         col="black", diag = FALSE, 
         tl.pos = 'n', cl.pos = 'n')

Correcting for multiple testing

When you run multiple tests, you increase the chance of finding a false positive. If two data sets do not differ, and you run a test a 100 times, on average 5 of those tests will show a significant difference at the 0.05 level. This is called the multiple testing problem.

Therefore, in order to trust the results of your tests, you need to correct for multiple testing. There are basically two main approaches to this:

Family wise error rate (FWER) correction, which controls the chance of making at least one false positive. Basically, we want the corrected p-values to mean just what the regular once do – the probability of making a false positive. The most popular method for this is the Bonferroni correction, which divides the significance threshold by the number of tests.
False discovery rate (FDR) correction, which controls the proportion of false positives among all significant results. The most popular method for this is the Benjamini-Hochberg (BH) correction.

The former is very conservative, which means that while indeed you make sure that the corrected p-values are what you think they are, you are also introducing a huge number of type II errors - false negatives.

The BH correction is more relaxed, and is often used in high-throughput experiments in biology. Since it is not really a p-value it is good to refer to it as a q-value or FDR value. An FDR of, say, 0.05 means that among the results which have an FDR of 0.05 or less, at most 5% are expected to be false positives.

Both of these corrections can be done with the p.adjust() function in R. Say, we make a number of comparisons using the iris dataset:

library(tidyverse)
sv <- iris |> filter(Species != "setosa") |>
  mutate(Species=factor(Species))
pvals <- sapply(1:4, function(i) {
  t.test(sv[,i] ~ sv$Species)$p.value
})
names(pvals) <- colnames(sv)[1:4]
pvals

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
1.866144e-07 1.819483e-03 4.900288e-22 2.111534e-25

OK, so a lot has happened above that you have not seen before. First, why do we convert Species to a factor? The reason is that the Species is already a factor in the original data frame, but it has three levels: “setosa”, “versicolor”, and “virginica”. When we filter out the “setosa” species, the levels remain unchanged, and the t.test function will complain that we have too many groups. Therefore, we need to convert the Species to a factor with only two levels.

The sapply function is a way to apply a function to each value of a vector or list. Here, we apply an anonymous function, that is, defined without giving it a name, to every value in the vector 1:4. The anonymous function takes as parameter a single value from the vector, and returns the p-value of the t-test between the corresponding column of the sv data frame and the Species variable.

Another new thing that you have not seen previously is the ~ sign. Rather than running a t-test on two vectors, we run it on a formula. We will cover formulas in a moment, but basically here it means for the t-test that the Species vector defines the groups, while column sv[,i] defines the variable to be tested.

We run 4 comparisons, and assuming that there were no differences between the species, we would expect 5% of the tests to be significant at the 0.05 level – which means that the probability of having at least one false positive in 4 tests is \(1 - (1 - 0.05)^4 = 0.185\). Let’s see if we can do something about it:

p.adjust(pvals, method="bonferroni")

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
7.464578e-07 7.277934e-03 1.960115e-21 8.446138e-25

p.adjust(pvals, method="BH")

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
2.488193e-07 1.819483e-03 9.800575e-22 8.446138e-25

As you can see, the corrected p-values are higher than the original ones, and the BH correction is less conservative (i.e., the p-values are smaller) than the Bonferroni correction.

Linear models with `lm()`

Simple linear models

We end with an example of linear regression, and the most important reason for that is the introduction of formulas in R.

A formula is a weird little construct. It contains variables, and links them using the ~ (tilde) sign. On the left side of the ~ are the dependent variables (the “y”), on the right side are the independent variables (the “x”, or covariates).

Depending on the particular function, the formula can mean different things and will have different syntax. For example, in a package like the DESeq2 from Bioconductor, there will be nothing on the left side – because DESeq2 understands that the formula applies to every single gene in the input matrix.

Here we will use the lm() function, which is the basic linear regression included in base R. Somewhat similar to tidyverse, you can use column names of a data frame in the formula, and specify the data frame with the data parameter. We will use it to model regression of the mathematical form

lm()

\[ y = a + b \cdot x + \epsilon \]

where \(a\) is the intercept, \(b\) is the slope, and \(\epsilon\) is the error.

As an example, we will use the mtcars dataset, which contains information about, you guessed it, cars (it is quite old – it comes from 1974). In the data frame, there are two columns that we will use: mpg (miles per gallon, so fuel usage given in the american way), and hp (horsepower). We will try to predict the miles per gallon based on the horsepower. However, rather than model mpg, we will use its inverse – gallons per mile, gpm, multiplied by 100 (so, effectively, gallons per 100 miles).

mtcars$gpm <- 100/mtcars$mpg
model <- lm(gpm ~ hp, data=mtcars)

The lm() function returns a model object, which contains a lot of information of no immediate use for us. To actually know the coefficients and p-values, it’s best to use either the summary() function, or the tidy() function from the broom package.

summary(model)


Call:
lm(formula = gpm ~ hp, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.19776 -0.56724 -0.07017  0.24239  3.12691 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.741786   0.456521   6.006 1.37e-06 ***
hp          0.018277   0.002827   6.464 3.84e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.079 on 30 degrees of freedom
Multiple R-squared:  0.5821,    Adjusted R-squared:  0.5682 
F-statistic: 41.79 on 1 and 30 DF,  p-value: 3.839e-07

We have two rows in the “Coefficients” table, because we had two coefficients in our model: \(a\), the intercept, and \(b\), the slope. The \(b\) coefficient answers the question: how much more miles per gallon do we get if we reduce horse power by 1?

We can plot how the model fits our data with ggplot2:

library(ggplot2)
a <- coefficients(model)[1]
b <- coefficients(model)[2]
ggplot(mtcars, aes(x = hp, y = gpm)) +
  geom_point() +
  geom_abline(intercept = a, slope = b)

coefficients()

The coefficients() function extracts the vector with the coefficients from the model.

But wait. The intercept, \(a\), is the fuel usage when the car’s horsepower is 0. Logically, the fuel usage of a car with 0 horsepower should be precisely 0, and not almost 3. Any value other than 0 simply doesn’t make sense. We can tell lm() that the intercept should be 0 quite easily:

model_0 <- lm(gpm ~ 0 + hp, data=mtcars)
summary(model_0)


Call:
lm(formula = gpm ~ 0 + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6238 -0.5268  0.9824  1.4068  2.7063 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
hp 0.033703   0.001725   19.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.576 on 31 degrees of freedom
Multiple R-squared:  0.9249,    Adjusted R-squared:  0.9225 
F-statistic: 381.7 on 1 and 31 DF,  p-value: < 2.2e-16

As you can see, we have now only one coefficient – because we forced the other one to be 0 with the 0 + syntax.

Additional covariates

The nice thing about this type of approach is that it can be easily extended to model much more complex situations. For example, what else does the fuel usage depend on? One of the columns in the mtcars dataset is the weight of the car. We can add it to the model like this:

model_2 <- lm(gpm ~ 0 + hp + wt, data=mtcars)
summary(model_2)


Call:
lm(formula = gpm ~ 0 + hp + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0183 -0.4441  0.1447  0.5905  1.1068 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
hp 0.007443   0.002364   3.148   0.0037 ** 
wt 1.330059   0.113669  11.701 1.05e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6791 on 30 degrees of freedom
Multiple R-squared:  0.9865,    Adjusted R-squared:  0.9856 
F-statistic:  1096 on 2 and 30 DF,  p-value: < 2.2e-16

Again we are using the 0 + syntax to force the intercept to be 0 – which makes sense, since a car with no horsepower and no weight should use no fuel. The summary() function shows us that the weight of the car is also significant in predicting the fuel usage, although the \(p\)-value for the hp coefficient is now much higher. Well, there is a correlation between horsepower and weight.

But which model is better? If you look at the summaries above, you will find that the R-squared value is given. This is a measure of how well the model fits the data. The closer it is to 1, the better the model. For model_0, the R-squared is 0.92, and for model_2 it is 0.99.

However, adding more variables to the model will always increase the fit, leading to the situation we call overfitting, because while increasing the fit to this particular dataset we will be decreasing the models predictive power.

model_huge <- lm(gpm ~ 0 + hp + wt + qsec + drat + disp + cyl, data=mtcars)
summary(model_huge)


Call:
lm(formula = gpm ~ 0 + hp + wt + qsec + drat + disp + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6976 -0.4062  0.1508  0.3457  1.3327 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)   
hp    0.004814   0.004045   1.190  0.24479   
wt    1.031644   0.329317   3.133  0.00425 **
qsec -0.011301   0.072827  -0.155  0.87788   
drat  0.201028   0.272197   0.739  0.46680   
disp  0.002103   0.003253   0.646  0.52370   
cyl   0.062671   0.164361   0.381  0.70608   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6913 on 26 degrees of freedom
Multiple R-squared:  0.9879,    Adjusted R-squared:  0.9851 
F-statistic:   353 on 6 and 26 DF,  p-value: < 2.2e-16

One way we can avoid overfitting is by using another measure of model fit, AIC (Akaike Information Criterion). The AIC() function calculates the AIC for a model, which is a measure of how well the model fits the data, but penalizes the number of parameters. The lower the AIC, the better the model.

AIC(model_0)

[1] 122.8965

AIC(model_2)

[1] 69.97504

AIC(model_huge)

[1] 74.53647

In the above examples we have been using continuous variables, but we can use almost anything with linear models. For example, we can ask how the Sepal.Length of the iris dataset depends on the Species:

model_iris <- lm(Sepal.Length ~ Species, data=iris)
summary(model_iris)


Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

In fact, the above model is equivalent to an ANOVA test. The individual \(p\)-values above are actually not of immediate interest, since in ANOVA we want to first test if there is any difference between the groups, and only then test which groups differ. This can be done with the anova() function:

anova(model_iris)

Analysis of Variance Table

Response: Sepal.Length
           Df Sum Sq Mean Sq F value    Pr(>F)    
Species     2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals 147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is much more to ANOVA, the lm() function, and to linear models in general. If you are interested, I would recommend reading the R Book by Michael J. Crawley, which is a great resource for learning statistics in R.

Interactions

Consider the following example. We have two strains, knockout (KO) and wild type (WT), and we measure something (let’s say, the expression of a gene) in two conditions, control and treatment.

We want to assess the impact of the strain on how an animal reacts to the treatment. Clearly, we can’t just compare the treatment in WT with the treatment in KO – maybe the gene expression in KO is generally higher, independent of the treatment.

It would be also incorrect to compare the control with the treatment separately in the WT and KO strains. Chances are, we will get a significant difference in one, but not the other case – but that does not mean that there is a difference between the strains. Maybe the change is really present in both strains, but we failed to detect it in one case. Here is what I mean:

As you can see, the treatment has similar effect on both WT and KO (actually, I made sure of it when generating the data set), but it is not significant in the WT (p = 0.053) and significant in the KO (p = 0.00021).

Instead, what we need to figure out is whether the difference between the groups in KO is different from the difference between the groups in WT. This “difference of differences” is called an interaction. We can write it like this:

\[\textrm{Interaction} = (\textrm{treated}_{\textrm{KO}} - \textrm{control}_{\textrm{KO}}) - (\textrm{treated}_{\textrm{WT}} - \textrm{control}_{\textrm{WT}})\]

Interaction will allow us to find strain-specific differences even if both, there is an effect of the strain and there is an effect of the treatment. For example, the average expression can be generally higher in one strain.

We should now generate a new data frame with the interaction term:

dfi <- data.frame(nn = 1:(n*4),
  strain = rep(c("WT", "KO"), each=2*n),
  group = rep(c("control", "treatment"), each = n)) |>
  mutate(strain = factor(strain, levels=c("WT", "KO"))) |>
  mutate(value = 
    c(rnorm(n, mean=10, sd=1),
      rnorm(n, mean=10, sd=1),
      rnorm(n, mean=12, sd=1),
      rnorm(n, mean=14, sd=1)))
ggplot(dfi, aes(x = group, y = value)) +
  geom_boxplot() +
  geom_beeswarm() +
  facet_wrap(~ strain)

As you can see, there is an effect of the strain (the KO has a higher expression on average), and an effect of the treatment – but only in the KO strain.

We can easily run an ANOVA with interaction term in R. As before, we use a formula, but now instead of adding (with a +) the two terms, strain and group, we multiply them (with a *):

model1 <- lm(value ~ strain * group, data=dfi)
anova(model1)

Analysis of Variance Table

Response: value
             Df Sum Sq Mean Sq F value    Pr(>F)    
strain        1 64.368  64.368 43.3669 1.156e-07 ***
group         1  3.044   3.044  2.0507  0.160758    
strain:group  1 15.938  15.938 10.7381  0.002329 ** 
Residuals    36 53.434   1.484                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the ANOVA of the model above, we see the strong effect of the strain, and although we do not see a significant overall effect of the group (because there is no effect in the WT strain), we do see a significant interaction – that is the coefficient strain:group.

Let us now take a look at a negative example. In the data frame below, we have an effect of the strain, but also an effect of the group – in both strains, the expression goes up. However, the effect is identical in both strains – the expression shifts up by the same amount in both strains. In other words, there is no interaction between strain and group.

dfni <- data.frame(nn = 1:(n*4),
  strain = rep(c("WT", "KO"), each=2*n),
  group = rep(c("control", "treatment"), each = n)) |>
  mutate(strain = factor(strain, levels=c("WT", "KO"))) |>
  mutate(value = 
    c(rnorm(n, mean=10, sd=1),
      rnorm(n, mean=12, sd=1),
      rnorm(n, mean=13, sd=1),
      rnorm(n, mean=15, sd=1)))
ggplot(dfni, aes(x = group, y = value)) +
  geom_boxplot() +
  geom_beeswarm() +
  facet_wrap(~ strain)

model2 <- lm(value ~ strain * group, data=dfni)
anova(model2)

Analysis of Variance Table

Response: value
             Df Sum Sq Mean Sq  F value    Pr(>F)    
strain        1 99.952  99.952 121.5962 4.279e-13 ***
group         1 26.923  26.923  32.7534 1.630e-06 ***
strain:group  1  1.582   1.582   1.9249    0.1738    
Residuals    36 29.592   0.822                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA shows now a significant effect of the group (p = 1.6e-06), and a strong effect of the strain (p = 4.3e-13), but no effect of the interaction (p = 0.17).

Image sizes in R markdown

Unlike in other programs, you can not simply drag images to change their size or aspect ratio in R, and you cannot click on a text and choose the font size. What can you do when you want to change the size of the image, to make the fonts larger relative to the image size etc.?

For font size, it is possible to change individual font sizes (and font types) in both basic R and ggplot2, however these operations tend to be annoying (see here for a description of how to do it in ggplot2).

However, quite often what you want to achieve is simply make the overall fonts larger in an image. If you are using a vector graphic output (as you should), it does not really matter whether the font size is actually 12, 22 or 32 points – you can always scale the image to the desired size without losing quality. What matters is how large the fonts are in relation to the nominal image size in inches.

The good news is: you can change the nominal size of an image in R very easily. You have already seen it done in the section 5.3.6 Output formats when generating images with svg() or pdf(). However, you can change this size also in your R Markdown document by providing chunk options fig.width and fig.height:

```{r fig.width=4,fig.height=3}
ggplot(iris, 
  aes(x = Sepal.Length, 
      y = Sepal.Width, 
      color = Species)) +
  geom_point(size=3) +
  scale_color_viridis_d() +
  labs(x = "Sepal length", 
       y = "Sepal width", 
       title = "The iris data set") +
  theme_minimal()
```

```{r fig.width=8,fig.height=8}
ggplot(iris, 
  aes(x = Sepal.Length, 
      y = Sepal.Width, 
      color = Species)) +
  geom_point(size=3) +
  scale_color_viridis_d() +
  labs(x = "Sepal length", 
       y = "Sepal width", 
       title = "The iris data set") +
  theme_minimal()
```

As you can see, the only difference between these two chunks are the values next to fig.width and fig.height. Since the left-hand figure is smaller, even though in reality the fonts have the same sizes in points, they appear much larger.

Combining several plots into one

Quite often you will want to combine several plots into one. This can be done in R in several ways, different for base R and ggplot2.

Base R

For base R, the key function is par(). This function has many applications, but the one that we focus on here is the mfrow parameter, which specifies how many columns and how many rows you wish to have on the output. Then you create the subsequent plots as you would normally do, and they get placed in the grid. Here is an example with one row and two columns:

par(mfrow=c(1, 2))
hist(iris$Sepal.Length)
boxplot(iris$Sepal.Length ~ iris$Species)

There are ways and packages to fine tune this, for example setting different sizes of columns and rows in the grid above –– look up the layout() function if you need it.

`ggplot2`

First, quite often in ggplot2 you can simply use facet_wrap() or facet_grid() to combine several plots into one. However, these plots are then of the same type.

If you want to combine different types of plots, there are two packages that might interest you. The cowplot package allows you to combine multiple plots with the plot_grid() function.

library(cowplot)

# we save the plots into a variable 
# instead of printing them to the screen
p1 <- ggplot(iris, aes(x=Sepal.Length)) +
  geom_histogram()

p2 <- ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot()

# plot both plots with labels in one row
plot_grid(p1, p2, 
  ncol = 2,
  labels = c('A', 'B'), 
  rel_widths = c(0.6, 0.4),
  label_size = 12)

There are no limits – you can overlay plots, you can change the relative sizes of the plots, and you can even add base R plots¹:

¹ However, you need to install the package gridGraphics to do that.

# use the formula notation to create a base R subplot
p3 <- ~hist(iris$Sepal.Width)

# create a one column, two-row subplot
p13 <- plot_grid(p1, p3, 
  ncol = 1,
  labels = c('A', 'B'), 
  label_size = 12)

# put the subplots together
# no need for a label for p13, 
# it already has labels
plot_grid(p13, p2, 
  labels = c("", "C"),
  ncol=2)

Another package that is very popular is patchwork. It is more elegant and allows a lot of extras. This comes at the cost of a bit more complex syntax.

library(patchwork)
p3 <- ggplot(iris, aes(x=Sepal.Width)) +
  geom_histogram()
(p1 / p3) | p2

Boxplots and violin plots vs bar plots

Bar plots are quite common in scientific literature, despite the fact that they actually should be avoided in most scenarios (“Kick the Bar Chart Habit” 2014). Bar plots should actually only be used when showing count or proportion data, and the \(y\) axis in this case should always start at zero. In all the other applications, box plots or, better, violin plots are preferred.

The advantages of the box plots and violin plots over bar plots are evident when the data is not normally distributed. Let us construct a small example.

n <- 250
x <- rnorm(n, mean=20, sd=1)
y <- rbeta(n, 2, 22) * 20  + 18.65

The vectors x and y are normally distributed and beta distributed, respectively, and they have been on purpose manipulated such that the standard deviations calculated with sd() and means calculated with mean() are the same. Here is a ggplot2 code that produces a bar chart. Don’t worry too much about the syntax below: it is here for demonstration purposes only, and, hopefully, you will rarely need bar plots in practice. The function geom_bar() is the one responsible for the bar plots; geom_errorbar() adds the error bars; coord_cartesian() is here to limit the y-axis to a certain range.

geom_bar(), geom_errorbar(), coord_cartesian()

df <- data.frame(value=c(mean(x), mean(y)), 
                 group=c("x", "y"),
                 sd=c(sd(x)/sqrt(n), sd(y)/sqrt(n)))
ggplot(df, aes(x=group, y=value)) + 
  geom_errorbar(aes(ymin = value - sd, ymax = value + sd), width = 0.2) +
  geom_bar(stat="identity", width=0.5, fill = "skyblue") +
  labs(x="Group", y="Value") +
  coord_cartesian(ylim = c(19.5, 20.5))

Wow, that looks really different! Unfortunately, the above figure is a lie. It suggests something that it is not true.

Firstly, instead of standard deviations, which show the true spread of the data, I used the standard error of the mean (SEM) equal to \(\frac{\sigma}{\sqrt{n}}\), which shows the precision of the mean estimate and gets very small for large data sets even when the spread (standard deviation) is large – but it looks good on a figure. Secondly, the y-axis does not start at zero, which is not acceptable for bar plots.

And last but not least, one should not use a bar plot for continuous data, because it does not show their real distribution. Let us now produce a box plot and a violin plot.

Before, however, let me introduce yet another useful package, cowplot². With this package, we can create composite plots that consist of several individual plots. Basically, first you create the ggplot plots and save them to a variable; then you use cowplot function plot_grid() to put the plots .

cowplot

² There is also the newer patchwork package, which is more elegant and flexible, but the syntax requires a bit of getting used to.

plot_grid()

library(cowplot)
df <- data.frame(values = c(x, y), 
                 group = c(rep("x", n), rep("y", n)))
p1 <- ggplot(df, aes(x=group, y=values)) +
  geom_boxplot() +
  labs(x="Group", y="Value")

# violin plot
p2 <- ggplot(df, aes(x=group, y=values)) +
  geom_violin() +
  geom_boxplot(width=0.1) +
  labs(x="Group", y="Value")

# plot_grid puts the different plots together 
# ncol=2 -> two columns
# labels=... -> labels for the plots
plot_grid(p1, p2, ncol=2, labels=c("A", "B"))

As you can see, the violin plots show a completely different story. The group y only looks larger, because the mean is driven by the long upper tail of the distribution. The medians are practically identical (median of x is 19.97, median of y is 20.01), and the distributions largely overlap.

Similarly, if you were to run a t-test, which assumes normal distribution, the p-value would have been 0.00044; while in a Wilcoxon test, which does not make this assumption, the p-value would have been 0.02.