Solutions to Exercises

Day 1

Water lilies (Exercise 1.12)

# plots with a blue color
plot(days, area, type="b", col="blue")

# plots with filled circles
plot(days, area, type="b", col="blue", pch=19)

# plots with labels
plot(days, area, type="b",
  xlab="Day", ylab="Area", main="Area of water lilies")

# limits on the y axis and x axis
plot(days, area, type="b",
  xlim=c(0, 8), ylim=c(0, 1))

Formula for area when the growth is by 75% per day:

\[a(n) = 0.01 \times 1.75^{n-1}\]

area_slow <- 0.01 * 1.75^(days - 1)
plot(days, area, type="b", col="blue")

# add another data series to the plot
lines(days, area_slow, type="b", col="red")

Day 2

Matrices (Exercise 2.3)

# generate random readouts
readouts <- runif(48)

# create the matrix
res_mtx <- matrix(readouts, nrow=6)

# always check your results!
dim(res_mtx)
[1] 6 8

To change the “borders” to NA:

res_mtx[1,] <- NA
res_mtx[,1] <- NA
res_mtx[nrow(res_mtx), ] <- NA
res_mtx[, ncol(res_mtx)] <- NA

Above, we could have used res_mtx[6, ] <- NA instead of res_mtx[nrow(res_mtx), ] <- NA, but the latter is more general and will keep working if you decide to use a matrix with a different layout.

Row and column names:

row_names <- c("control", "low", "medium", "high")
row_names <- paste0("Inh1_", row_names)
row_names <- c(NA, row_names, NA)
rownames(res_mtx) <- row_names

col_names <- c("control", "low", "high")
col_names <- rep(col_names, 2)

inh <- rep(c("Inh2_", "Inh3_"), each=3)
col_names <- paste0(inh, col_names)
col_names <- c(NA, col_names, NA)
colnames(res_mtx) <- col_names

Above, we use NA for row names and column names of the border row / column. We could have used another value as well, it doesn’t matter.

Selecting from the matrix:

# wells with inhibitor 3
res_mtx[ 2:5, 5:6 ]
             Inh3_control  Inh3_low
Inh1_control   0.71422992 0.6657274
Inh1_low       0.52711822 0.6521810
Inh1_medium    0.40945933 0.8049797
Inh1_high      0.07877701 0.4460096
# inhibitor 1 as well as 2
inh1 <- c("Inh1_low", "Inh1_medium", "Inh1_high")
inh2 <- c("Inh2_low", "Inh2_high")
res_mtx[ inh1, inh2 ]
             Inh2_low Inh2_high
Inh1_low    0.9786829 0.6754574
Inh1_medium 0.7269673 0.4473528
Inh1_high   0.1366912 0.1345950

Data frames (Exercise 2.5)

mtx <- matrix(rnorm(15), nrow=5)
df <- as.data.frame(mtx)

colnames(df) <- c("X", "Y", "Z")
rownames(df) <- c("a", "b", "c", "d", "e")

df$A <- rep("A", 5)

df$sequence <- seq(0, 1, length.out=5)

df
            X            Y          Z A sequence
a -1.90635742 -0.596422448  0.7761556 A     0.00
b  0.75975917  0.161859283 -0.9482953 A     0.25
c  0.06380527  0.000362048 -1.1097728 A     0.50
d -1.67180221 -1.010560011 -2.3550369 A     0.75
e  0.53106772  0.684821178 -0.5697360 A     1.00

Day 3

Deaths.xlsx (Exercise 3.5)

If you look at the file, you will find that there are 4 lines with gibberish (which you can skip with skip=4), then the header row, then 10 lines with actual data, and finally a few lines with gibberish again. Thus, we need to read 1 + 10 = 11 lines, starting with line 5, and finishing with line 16.

fn <- readxl_example("deaths.xls")

# one way: use the n_max argument
deaths <- read_excel(fn, skip=4, n_max=11)

# another way: use the excel range specification
deaths <- read_excel(fn, range="A5:F16")

Diagnosing meta_data_botched.xlsx (Exercise 3.7)

botched <- read_excel("Datasets/meta_data_botched.xlsx")
colnames(botched)
[1] "SUBJ"       "time point" "AGE"        "SEX"        "PLACEBO"   
[6] "ARM"       

First thing that we see is that one of the columns is called time point with a space in between. This is neither conforming to the other column names (which are ALL CAPS) nor it is a good idea to have spaces in column names.

summary(botched)
      SUBJ       time point            AGE                SEX           
 Min.   :4023   Length:122         Length:122         Length:122        
 1st Qu.:4232   Class :character   Class :character   Class :character  
 Median :4429   Mode  :character   Mode  :character   Mode  :character  
 Mean   :4473                                                           
 3rd Qu.:4701                                                           
 Max.   :4973                                                           
   PLACEBO              ARM           
 Length:122         Length:122        
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
library(skimr)
skim(botched)
Data summary
Name botched
Number of rows 122
Number of columns 6
_______________________
Column type frequency:
character 5
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
time point 0 1 2 5 0 8 0
AGE 0 1 2 6 0 34 0
SEX 0 1 1 8 0 12 0
PLACEBO 0 1 1 5 0 10 0
ARM 0 1 1 16 0 20 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SUBJ 0 1 4472.51 287.12 4023 4232 4429 4701 4973 ▇▇▇▅▆

OK, so we see several problems. First, the column “Age” is not numeric, but we would expect it to be. Let’s take a closer look:

age_n <- as.numeric(botched$AGE)
Warning: NAs introduced by coercion
botched$AGE[ is.na(age_n) ]
 [1] "32 ,"   "23 yrs" "28 ,"   "30 ,"   "3 7"    "23 yrs" "32 yrs" "40 yrs"
 [9] "35 yrs" "22 ,"   "30 ,"  

Right, so someone added extra text or spaces to the the actual values. And what does “3 0” mean? It could be 30, but it also could mean “3 years 0 months” – who knows?

Next, the SEX column. skim() shows that it has 12 unique values, but this is not the number we expect. What are these unique values?

unique(botched$SEX)
 [1] "Mann"     "M"        "männlich" "F"        "female"   "Female"  
 [7] "Herr"     "Male"     "male"     "Frau"     "Femal"    "Weiblich"

Maybe the data was entered by hand by different people, and they did not agree beforehand on how to enter the data. We see a similar problem also in the columns PLACEBO, ARM and time point:

unique(botched$PLACEBO)
 [1] "TRUE"  "yes"   "T"     "1"     "true"  "FALSE" "false" "0"     "no"   
[10] "F"    
unique(botched$ARM)
 [1] "Placebo"          "placebo"          "P"                "Plazebo"         
 [5] "control"          "Placibo"          "F"                "fluad vaccine"   
 [9] "A"                "Fl."              "agrippal vaccine" "agrippal"        
[13] "Agrippal vaccine" "grippal"          "AGRP."            "Fuad"            
[17] "Fluad"            "Agrippal"         "FLUAD"            "Fluad vaccine"   
unique(botched[["time point"]])
[1] "D0"    "D1"    "Day1"  "D 0"   "day 1" "day 0" "D 1"   "Day0" 

Correcting meta_data_botched.xlsx (Exercise 3.13)

Let’s start with the easiest one, the time point. First, we should fix the name of the column:

# either works
botched <- rename(botched, TIMEPOINT=`time point`)
# or
colnames(botched)[2] <- "TIMEPOINT"
colnames(botched)
[1] "SUBJ"      "TIMEPOINT" "AGE"       "SEX"       "PLACEBO"   "ARM"      

Now, fix the values in the TIMEPOINT column. We can use the toupper().

# change to upper case
tp <- toupper(botched$TIMEPOINT)

# replace day with "D"
tp <- str_replace_all(tp, "DAY", "D")

# remove all spaces
tp <- str_replace_all(tp, " *", "")

table(tp, botched$TIMEPOINT)
    
tp   D 0 D 1 D0 D1 day 0 day 1 Day0 Day1
  D0   5   0 52  0     3     0    1    0
  D1   0   5  0 50     0     2    0    4
# looks good!
botched$TIMEPOINT <- tp

The ARM column should be easy, too. We only have to pay attention to the first letter. There are two exceptions, though – sometimes control is used instead of placebo, and in some cases grippal has been used instead of agrippal.

arm <- toupper(botched$ARM)
arm <- str_replace_all(arm, "^P.*", "PLACEBO")
arm <- str_replace_all(arm, "^F.*", "FLUAD")
arm <- str_replace_all(arm, "^A.*", "AGRIPPAL")
arm <- str_replace_all(arm, "^C.*", "PLACEBO")
arm <- str_replace_all(arm, "^G.*", "AGRIPPAL")
unique(arm)
[1] "PLACEBO"  "FLUAD"    "AGRIPPAL"
botched$ARM <- arm

Proceed with the columns SEX and PLACEBO in the same way.

Finally, AGE. The simplest way would be to replace everything that is not a digit:

age <- botched$AGE
age <- str_replace_all(age, "[^0-9]", "")
age <- as.numeric(age)

# check for NAs
any(is.na(age))
[1] FALSE
# replace the original column
botched$AGE <- age

Day 4

Selecting columns (Exercise 4.1)

Solution to Exercise 4.1.

# Load the data
library(tidyverse)

results <- read_csv("Datasets/transcriptomics_results.csv")

# what columns are there?
colnames(results)
 [1] "...1"           "ProbeName"      "GeneName"       "SystematicName"
 [5] "Description"    "logFC.F.D0"     "qval.F.D0"      "logFC.F.D1"    
 [9] "qval.F.D1"      "logFC.F.D2"     "qval.F.D2"      "logFC.F.D3"    
[13] "qval.F.D3"     
# Select the columns that we need
results <- select(results, GeneName, Description, logFC.F.D1, qval.F.D1)
colnames(results) <- c("Gene", "Description", "logFC", "qval")

Alternatively, we could have specified the new column names directly in the select() function. This is useful if you want to rename only some of the columns:

results <- select(results, Gene=GeneName, 
  Description, 
  LFC=logFC.F.D1, FDR=qval.F.D1)
colnames(results)
[1] "Gene"        "Description" "LFC"         "FDR"        

Sorting by last name (Exercise 4.3)

persons <- c("Henry Fonda", "Bob Marley", "Robert F. Kennedy", 
  "Bob Dylan", "Alan Rickman")

# first, create a vector with last names only. 
# basically, remove everything before the last space
# and the space itself
lastnames <- str_replace_all(persons, ".* ", "")
lastnames
[1] "Fonda"   "Marley"  "Kennedy" "Dylan"   "Rickman"
# now get the order for the last names
ord <- order(lastnames)
ord
[1] 4 1 3 2 5
# and use it to sort the original vector
persons[ord]
[1] "Bob Dylan"         "Henry Fonda"       "Robert F. Kennedy"
[4] "Bob Marley"        "Alan Rickman"     

Logical vectors (Exercise 4.6)

significant <- tr_res$FDR < 0.05
interferons <- str_detect(tr_res$Description, "interferon")
both <- significant & interferons

# how many are there?
sum(both)
[1] 23
# how many are significant, but not interferons?
sum(significant & !interferons)
[1] 1455
gbps <- str_detect(tr_res$Gene, "^GBP")
sum(gbps & interferons)
[1] 4

The merging of two data frames (Exercise 4.9)

Solution to Exercise 4.9.

First, we load the data. Nothing fancy here. You should examine the resulting data frames with View (or by clicking on the data frame in the Environment pane in RStudio), but in the code below we simply list the available columns.

# The libraries needed
library(tidyverse)
library(readxl)

# Load the data
labresults <- read_csv("Datasets/labresults_full.csv")
targets <- read_excel("Datasets/expression_data_vaccination_example.xlsx", 
  sheet="targets")

# what columns are there?
colnames(labresults)
 [1] "USUBJID"   "SUBJ"      "LBTEST"    "LBTESTCD"  "LBCAT"     "LBORRES"  
 [7] "LBORRESU"  "LBSPEC"    "VISIT"     "Timepoint"
colnames(targets)
 [1] "USUBJID"   "SUBJ"      "Batch"     "ID"        "Timepoint" "ARMCD"    
 [7] "ARM"       "AGE"       "SEX"       "PLACEBO"  
intersect(colnames(labresults), colnames(targets))
[1] "USUBJID"   "SUBJ"      "Timepoint"

intersect

The intersect() function returns the common elements between two vectors. In this case, it returns the common column names between the two data frames. As you can see, there are USUBJID and SUBJ columns in both data frames, chances are that there are common elements between the data frame here. We will focus at the SUBJ column.

OK, do we see any common subjects between the two data frames?

# how many unique subjects are there in each data frame?
length(unique(labresults$SUBJ))
[1] 123
length(unique(targets$SUBJ))
[1] 61
# how many are common?
length(intersect(labresults$SUBJ, targets$SUBJ))
[1] 61

Right, so there are more unique subjects in the labresults data frame than in the targets data frame.

The other column that is common between the two data frames is Timepoint. That already indicates that each row – in both data frames – corresponds to a sample rather than subject. That is, we need to identify the rows not only by subject, but also by timepoint.

If you use the unique() function as above, you will find that there are only 2 unique timepoints in the targets data frame, but 23 unique timepoints in the labresults data frame.

Now first let us select only the column that we need, as mentioned in the exercise. For the targets data frame, we need the columns SUBJ and Timepoint, as well as ARM, Timepoint, AGE and SEX.

targets <- select(targets, SUBJ, Timepoint, ARM, AGE, SEX)

If you inspect the labresults data frame, you will see that it has only one column that looks like a measurement – LBORRES. Also, the column LBTEST ostensibly contains the name of the test that was performed, and LBTESTCD contains the code for the test. However, we also find the Timepoint column here as well.

labresults <- select(labresults, SUBJ, Timepoint, LBTEST, LBTESTCD, LBORRES)

Depending on what one wants to do, we can try to get one of the four types of join (inner, left, right, full) between the two data frames. However, assuming that the goal is correlate expression data with lab data, we will need an inner join, so either the tidyverse function inner_join() or the base R function merge() with default parameter values will do the job.

joined <- merge(targets, labresults, by=c("SUBJ", "Timepoint"))
dim(joined)
[1] 3782    8
colnames(joined)
[1] "SUBJ"      "Timepoint" "ARM"       "AGE"       "SEX"       "LBTEST"   
[7] "LBTESTCD"  "LBORRES"  

Yep, seems to have worked well. The resulting data frame has 3782 rows – since we have many different measurements for each person.

Day 5