Above, we could have used res_mtx[6, ] <- NA instead of res_mtx[nrow(res_mtx), ] <- NA, but the latter is more general and will keep working if you decide to use a matrix with a different layout.
X Y Z A sequence
a -1.0461064 0.6138361 0.07144751 A 0.00
b -0.5439397 -0.6508163 -0.09571267 A 0.25
c -1.8100360 0.9896839 2.06337270 A 0.50
d 0.9436682 -2.6995960 0.36789883 A 0.75
e 0.2015024 -0.5347336 -0.19902897 A 1.00
If you look at the file, you will find that there are 4 lines with gibberish (which you can skip with skip=4), then the header row, then 10 lines with actual data, and finally a few lines with gibberish again. Thus, we need to read 1 + 10 = 11 lines, starting with line 5, and finishing with line 16.
fn <-readxl_example("deaths.xls")# one way: use the n_max argumentdeaths <-read_excel(fn, skip=4, n_max=11)# another way: use the excel range specificationdeaths <-read_excel(fn, range="A5:F16")
First thing that we see is that one of the columns is called time point with a space in between. This is neither conforming to the other column names (which are ALL CAPS) nor it is a good idea to have spaces in column names.
summary(botched)
SUBJ time point AGE SEX
Min. :4023 Length:122 Length:122 Length:122
1st Qu.:4232 Class :character Class :character Class :character
Median :4429 Mode :character Mode :character Mode :character
Mean :4473
3rd Qu.:4701
Max. :4973
PLACEBO ARM
Length:122 Length:122
Class :character Class :character
Mode :character Mode :character
library(skimr)skim(botched)
Data summary
Name
botched
Number of rows
122
Number of columns
6
_______________________
Column type frequency:
character
5
numeric
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
time point
0
1
2
5
0
8
0
AGE
0
1
2
6
0
34
0
SEX
0
1
1
8
0
12
0
PLACEBO
0
1
1
5
0
10
0
ARM
0
1
1
16
0
20
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
SUBJ
0
1
4472.51
287.12
4023
4232
4429
4701
4973
▇▇▇▅▆
OK, so we see several problems. First, the column “Age” is not numeric, but we would expect it to be. Let’s take a closer look:
Right, so someone added extra text or spaces to the the actual values. And what does “3 0” mean? It could be 30, but it also could mean “3 years 0 months” – who knows?
Next, the SEX column. skim() shows that it has 12 unique values, but this is not the number we expect. What are these unique values?
Maybe the data was entered by hand by different people, and they did not agree beforehand on how to enter the data. We see a similar problem also in the columns PLACEBO, ARM and time point:
Now, fix the values in the TIMEPOINT column. We can use the toupper().
# change to upper casetp <-toupper(botched$TIMEPOINT)# replace day with "D"tp <-str_replace_all(tp, "DAY", "D")# remove all spacestp <-str_replace_all(tp, " *", "")table(tp, botched$TIMEPOINT)
tp D 0 D 1 D0 D1 day 0 day 1 Day0 Day1
D0 5 0 52 0 3 0 1 0
D1 0 5 0 50 0 2 0 4
# looks good!botched$TIMEPOINT <- tp
The ARM column should be easy, too. We only have to pay attention to the first letter. There are two exceptions, though – sometimes control is used instead of placebo, and in some cases grippal has been used instead of agrippal.
# Select the columns that we needresults <-select(results, GeneName, Description, logFC.F.D1, qval.F.D1)colnames(results) <-c("Gene", "Description", "logFC", "qval")
Alternatively, we could have specified the new column names directly in the select() function. This is useful if you want to rename only some of the columns:
persons <-c("Henry Fonda", "Bob Marley", "Robert F. Kennedy", "Bob Dylan", "Alan Rickman")# first, create a vector with last names only. # basically, remove everything before the last space# and the space itselflastnames <-str_replace_all(persons, ".* ", "")lastnames
[1] "Fonda" "Marley" "Kennedy" "Dylan" "Rickman"
# now get the order for the last namesord <-order(lastnames)ord
[1] 4 1 3 2 5
# and use it to sort the original vectorpersons[ord]
significant <- tr_res$FDR <0.05interferons <-str_detect(tr_res$Description, "interferon")both <- significant & interferons# how many are there?sum(both)
[1] 23
# how many are significant, but not interferons?sum(significant &!interferons)
First, we load the data. Nothing fancy here. You should examine the resulting data frames with View (or by clicking on the data frame in the Environment pane in RStudio), but in the code below we simply list the available columns.
# The libraries neededlibrary(tidyverse)library(readxl)# Load the datalabresults <-read_csv("Datasets/labresults_full.csv")targets <-read_excel("Datasets/expression_data_vaccination_example.xlsx", sheet="targets")# what columns are there?colnames(labresults)
The intersect() function returns the common elements between two vectors. In this case, it returns the common column names between the two data frames. As you can see, there are USUBJID and SUBJ columns in both data frames, chances are that there are common elements between the data frame here. We will focus at the SUBJ column.
OK, do we see any common subjects between the two data frames?
# how many unique subjects are there in each data frame?length(unique(labresults$SUBJ))
[1] 123
length(unique(targets$SUBJ))
[1] 61
# how many are common?length(intersect(labresults$SUBJ, targets$SUBJ))
[1] 61
Right, so there are more unique subjects in the labresults data frame than in the targets data frame.
The other column that is common between the two data frames is Timepoint. That already indicates that each row – in both data frames – corresponds to a sample rather than subject. That is, we need to identify the rows not only by subject, but also by timepoint.
If you use the unique() function as above, you will find that there are only 2 unique timepoints in the targets data frame, but 23 unique timepoints in the labresults data frame.
Now first let us select only the column that we need, as mentioned in the exercise. For the targets data frame, we need the columns SUBJ and Timepoint, as well as ARM, Timepoint, AGE and SEX.
If you inspect the labresults data frame, you will see that it has only one column that looks like a measurement – LBORRES. Also, the column LBTEST ostensibly contains the name of the test that was performed, and LBTESTCD contains the code for the test. However, we also find the Timepoint column here as well.
Depending on what one wants to do, we can try to get one of the four types of join (inner, left, right, full) between the two data frames. However, assuming that the goal is correlate expression data with lab data, we will need an inner join, so either the tidyverse function inner_join() or the base R function merge() with default parameter values will do the job.
joined <-merge(targets, labresults, by=c("SUBJ", "Timepoint"))dim(joined)