<- c(TRUE, FALSE, TRUE, TRUE, FALSE) logical_vector
2 Complex data structures
2.1 Goals for today
- Complex data structures (matrices, lists and data frames)
- Logical vectors and subsetting
2.2 Logical vectors
2.2.1 Truth and falsehood
Yesterday, we mentioned briefly two special values in R: TRUE
and FALSE
. These are logical constants, and they are used to represent truth and falsehood, respectively. Using these values, it is possible to create logical vectors – vectors that contain only these two values.
The FALSE
value is equivalent to 0
, and TRUE
is equivalent to 1
. That means, we can use logical vectors in arithmetic operations. One very common application of this is using sum()
to count the number of TRUE
values in a logical vector.
sum(logical_vector)
[1] 3
Logical vectors are incredibly useful, because they can be used to subset other objects. For example, if you have a vector of numbers, you can use a logical vector to select some of the elements.
<- 1:5
numbers numbers[ logical_vector ]
[1] 1 3 4
The real usefulness of logical vectors comes when you create them using different operators that check for equality, inequality, etc. Let’s look at that in more detail.
In R, TRUE
and FALSE
are the only two logical constants. However, there are also two other constants, T
and F
, which are equivalent to TRUE
and FALSE
, respectively. Do not use them. Unlike TRUE
and FALSE
, they can be overwritten, which can lead to chaos and mayhem.
2.2.2 Comparison operators
There are six comparison operators in R:
==
– equality!=
– inequality>
– greater than<
– less than>=
– greater than or equal to<=
– less than or equal to
They are vectorized like any other arithmetic operators, but their result is not a number – but a logical value. Take a look:
<- c(42, 3, -17, 0, -2, 1)
numbers > 0 numbers
[1] TRUE TRUE FALSE FALSE FALSE TRUE
For each element of the vector, R checks if it is greater than zero. If it is, it returns TRUE
, otherwise FALSE
, producing, in the end, a vector containing as many elements as there were in the vector numbers
. This logical vector can be used to subset the original vector.
# prints only numbers greater than 0
> 0 ] numbers[ numbers
[1] 42 3 1
# prints only numbers different from 0
!= 0 ] numbers[ numbers
[1] 42 3 -17 -2 1
But wait, there is more. There is a number of functions that check the elements of a vector and return logical vectors. One of the most commonly used and most useful of these is the is.na()
function, which checks if an element is “not available”. This will come in handy later, when we start reading data from files – data which is full of NA’s!
<- c(42, 3, NA, 0, -2, 1)
numbers is.na(numbers)
[1] FALSE FALSE TRUE FALSE FALSE FALSE
But hey, this does not tell us which elements are NA’s. It is easy to see in the example above, but what if we have a vector with a million elements? Actually, to answer which elements are NA’s you can simply use the which()
function:
which(is.na(numbers))
[1] 3
Fine, but what about the numbers which are not NA
? What if we want to find all “good” numbers and store them for future use? In this case, we can use the !
operator, which negates the logical vector. That is, each TRUE
becomes FALSE
and each FALSE
becomes TRUE
.
<- is.na(numbers)
nas nas
[1] FALSE FALSE TRUE FALSE FALSE FALSE
# change TRUE to FALSE and FALSE to TRUE
!nas
[1] TRUE TRUE FALSE TRUE TRUE TRUE
<- numbers[!nas]
useful_numbers useful_numbers
[1] 42 3 0 -2 1
There is one more thing to mention here. The comparison operators ==
and !=
can be used to compare strings as well1.
1 Actually, they can be used to compare any two objects in R. Also, the >
and <
operators can be used to compare strings, but the result is not always what you might expect. Can you guess what it does? Hint: pick up a dictionary… or any other alphabetically sorted list.
<- c(1, 16, 7, 42, 3)
patient_measurements <- c("male", "female", "female", "male", "female")
patient_gender == "male" ] patient_measurements[ patient_gender
[1] 1 42
2.3 Matrices
2.3.1 Creating matrices
Matrices are just what says on the box: 2-dimensional structures; just like in mathematics. They behave in a very similar way to vectors in R, for example, they always hold elements of the same type (either numeric, character, etc., but never elements of both types). If you have your data stored in an Excel spreadsheet, chances are that different columns have different types of data. You can hardly use matrices in such a case. In fact, you will probably rarely use matrices in R, at least at the beginning – however, they are very useful for storing large data sets, for example from a transcriptomic analysis. For storing “Excel-like” data you will be using lists and data frames, which we will discuss later today. Nonetheless, we will spend some time on matrices today – because 90% of what you will learn today about matrices you will be able to use with data frames as well.
Creating a matrix is very simple. You can use the matrix()
function, which takes a vector as input and reshapes it into a matrix:
<- matrix(1:12, nrow = 3, ncol = 4)
mtx print(mtx)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
As you remember from yesterdays course, the 1:12
is a vector of numbers from 1 to 12. The nrow
and ncol
parameters specify the number of rows and columns, respectively. You don’t really need to specify both of them – if you specify only one, R will calculate the other one for you.
As you have noticed above, by default the matrix()
function fills the matrix by columns. If you want to fill it by rows, you can use the byrow
parameter:
<- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
mtx print(mtx)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
Here we use a special value – TRUE
– which is a logical constant. It is equivalent to 1
, but it is more readable. You can also use FALSE
(or 0
) to specify that you want to fill the matrix by columns (this is the default).
It is also possible to create a large matrix by passing only one value (typically 0 or NA
) and specifying the number of rows and columns. In that case, you have to specify both nrow
and ncol
.
<- matrix(0, nrow = 3, ncol = 4) mtx
You can also create a matrix by binding vectors together, either by rows or by columns. The rbind()
function binds vectors by rows, and the cbind()
function binds vectors by columns. Always make sure that the vectors you bind have the same length and the same element type.
<- 1:3
a <- 4:6
b <- rbind(a, b)
mtx print(mtx)
[,1] [,2] [,3]
a 1 2 3
b 4 5 6
<- cbind(a, b)
mtx print(mtx)
a b
[1,] 1 4
[2,] 2 5
[3,] 3 6
R matrices are very powerful for linear algebra operations. If you ever learned linear algebra, you will find that R matrices can do pretty much everything you learned in class. For example, you can multiply matrices, transpose them, invert them, calculate determinants, etc. We will not cover these operations in this course.
Just like vectors have a length which you can check with the length()
function, matrices have dimensions – the number of rows and the number of columns. You can access them using the dim()
function, which returns a vector of length 2 (row and column number), and with functions nrow()
and ncol()
, which return the number of rows and columns, respectively.
dim(mtx)
[1] 3 2
nrow(mtx)
[1] 3
ncol(mtx)
[1] 2
2.3.2 Accessing matrix elements
For vectors, we have used the square brackets ([]
) to access elements. Same with matrices, really, however we have two dimensions now. Think about that: with vectors we could only select one or more elements. With matrices, it should be possible to select an element, a number of elements from a row (or the whole row), a number of elements from a column (or the whole column), or even a submatrix. All this is possible using the square brackets.
<- matrix(1:12, nrow = 3, ncol = 4)
mtx print(mtx)
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
# Accessing the element in the third row and the second column
3, 2] mtx[
[1] 6
# Accessing first three numbers in the second column
1:3, 2] mtx[
[1] 4 5 6
# Accessing second to fourth numbers in the first row
1, 2:4] mtx[
[1] 4 7 10
When we talk about plotting, the first dimension, “x”, is usually the horizontal one, and the second dimension, “y”, is the vertical one. However, in R matrices, just like in real algebra, the first dimension corresponds to the rows, and the second dimension corresponds to the columns. You need to get used to it – it’s the same for data frames which you will be using extensively.
If you select more then one row and more than one column, you will get a new matrix – albeit smaller than the original one.
# Selecting the first two rows and the last two columns
# This will create a 2 x 2 matrix
1:2, 3:4] mtx[
[,1] [,2]
[1,] 7 10
[2,] 8 11
If you want to access whole rows or columns, you do not need to specify anything – simply leave an empty space before (for selecting whole columsn) or after (for selecting whole rows) the comma.
# Selecting the whole second row
2, ] mtx[
[1] 2 5 8 11
# Selecting the whole third column
3] mtx[,
[1] 7 8 9
# Row 1 and three - returns a matrix
<- c(1, 3)
sel mtx[sel, ]
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 3 6 9 12
This last example shows that just like with vectors, we can use a variable to make our selection.
- Rows first, then columns
- If you select a single column or a single row, you will get a vector
- If you select more than one row or column, you will get a (smaller) matrix
- If you select more rows or columns than are present, you will get a “subscript out of bonds” error
- Vectors and matrices always have only one data type (string, numerical, logical etc.)
2.3.3 Row and column names
Just like in case of named vectors, we can name rows and columns of a matrix. However, for this we need two different functions: rownames()
and colnames()
(for row names and column names, duh).
rownames(mtx) <- c("first", "second", "third")
colnames(mtx) <- LETTERS[1:4]
"first", ] mtx[
A B C D
1 4 7 10
"A"] mtx[ ,
first second third
1 2 3
In the code above, we use the LETTERS
constant, which is a vector containing all the letters of the English alphabet. Just like constants pi
and e
, LETTERS
is available in R by default, along with its lower-case counterpart, letters
. It is useful for labeling.
Unfortunately, we only have 26 letters in the alphabet, so what can we do with a matrix that has more columns than that? Well, we can use the same trick that Excel uses: after Z
, we have AA
, AB
, etc.
To create such a long vector, we will use two functions: rep()
and paste0()
.
2.3.4 Using rep()
to generate column names
The rep()
function is a little and very useful utility function that repeats element of a vector a given number of times. It either repeats the whole vector several times, or, using the each
parameter, repeats each element a given number of times:
rep()
<- LETTERS[1:3] # A, B, C
abc <- rep(abc, 3)
abc3 abc3
[1] "A" "B" "C" "A" "B" "C" "A" "B" "C"
<- rep(abc, each = 3)
a3b3c3 a3b3c3
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
In the code above, we created two vectors, each of the length \(3 \times 3\); first one goes “A, B, C, A, B, C, …”, and the second one goes “A, A, A, B, B, …”. To get at our goal, we would have to paste together the first element from the first vector with the first element from the second vector etc.:
a3b3c3 | abc3 | result |
---|---|---|
A | A | AA |
A | B | AB |
A | C | AC |
B | A | BA |
B | B | BB |
B | C | BC |
To do this, we will use the paste0()
function, which concatenates strings and is vectorized, so it does exactly what we need.
<- paste0(abc3, a3b3c3) col_names
Of course, we need all the letters (we used only three in the example above for demonstration purposes).
<- length(LETTERS)
n <- rep(LETTERS, n)
abc3 <- rep(LETTERS, each = n)
a3b3c3 <- paste0(abc3, a3b3c3)
col_names length(col_names)
[1] 676
head(col_names)
[1] "AA" "BA" "CA" "DA" "EA" "FA"
tail(col_names)
[1] "UZ" "VZ" "WZ" "XZ" "YZ" "ZZ"
The head()
and tail()
functions are very useful for inspecting the beginning and the end of a very large object such as a vector, matrix, or data frame. They are very useful for checking if the operation you just performed did what you expected it to do.
2.4 Lists
2.4.1 Creating lists
The objects that we have discussed so far – vectors and matrices – can only hold one type of data; you cannot mix strings with numbers, for example. This is obviously a problem – quite frequently you need to store both numbers and strings in one object. This is where lists come in.
Lists are created using the list()
function. Lists have elements, just like vectors; but unlike vectors, every element can be of any possible type. It can be a vector, of course, but can also be a matrix, a data frame, even a function – or another list. Actually, it is quite common to have a list of lists (or even list of lists of lists) in R.
<- list(numbers=1:3, strings=c("a", "bu"),
lst matrix = matrix(1:4, nrow = 2),
logical = c(TRUE, FALSE))
lst
$numbers
[1] 1 2 3
$strings
[1] "a" "bu"
$matrix
[,1] [,2]
[1,] 1 3
[2,] 2 4
$logical
[1] TRUE FALSE
2.4.2 Accessing elements of a list
Like vectors, lists can be named. In fact, they very often are. However, accessing them is a bit different than with vectors.
You can use [
and ]
square brackets to access elements of a list, but this produces another list, containing only the selected elements. Thus, if you type lst["numbers"]
, you will get a list with one element, which is the vector of numbers:
"numbers"] lst[
$numbers
[1] 1 2 3
You can see that it is a list because of this weird $
(dollar) sign, which we will discuss in a moment. However, you can also check its type directly:
typeof(lst["numbers"])
[1] "list"
If, however, you want to work with the actual vector that is stored in the “numbers” slot of the list, you need to use one of two approaches. First approach is to use double square brackets:
"numbers"]] lst[[
[1] 1 2 3
typeof(lst[["numbers"]])
[1] "integer"
That requires a lot of typing, four times square brackets and then, in addition, the quote marks. But programmers are lazy, and therefore, we have a shortcut: the $
sign. It is used to access elements of a list by name:
$numbers lst
[1] 1 2 3
You will use this construct a lot in R.
The elements of a list behave exactly like regular variables. If an element is a vector, you can do with it all the things you can do with a vector; if it is a matrix, you can treat is as a matrix (because it is a matrix).
<- list(name = "John Doe", age = 42,
patient_data measurements = runif(5))
patient_data
$name
[1] "John Doe"
$age
[1] 42
$measurements
[1] 0.07455539 0.55447963 0.11861809 0.91279924 0.47057615
$measurements[1] patient_data
[1] 0.07455539
$measurements[1] <- 42
patient_data$measurements * 3 patient_data
[1] 126.0000000 1.6634389 0.3558543 2.7383977 1.4117284
Since the lists are named, there must be a way to access and modify these names. And, of course, there is: the names()
function.
names(patient_data)
[1] "name" "age" "measurements"
names(patient_data) <- c("patient_name", "patient_age", "patient_measurements")
names(patient_data)[3] <- "crp_measurement"
If you type the name of your data frame variable in a script, the $
and press the TAB key, RStudio will show you all the elements of a list (or columns of the data frame) to choose from. No need for tedious typing!
2.4.3 Lists as return values
A common application of lists has something to do with functions. Remeber that a function can return only one object? But what if a function would like to return several things at once? It can return a list!
We will run now our first statistical test. First, we need to generate two groups of measurements to compare. We will simulate them using the rnorm()
function which produces normally distributed random numbers. The function takes additional parameters, mean
and sd
, which specify the mean and the standard deviation of the distribution, respectively. That allows us to ensure that the groups differ:
<- rnorm(10, mean = 10, sd = 2)
group_a <- rnorm(10, mean = 14, sd = 2) group_b
Of course, before running any statistical test we usually want to have a look at the data, to see if the groups differ visually. We can do this by using the boxplot()
function, which creates a boxplot of the data.
boxplot()
boxplot(group_a, group_b)
Boxplots are a great way to visualize the data. The whiskers show the minimum and maximum values (excluding outliers, which are shown as separate points), the box shows the interquartile range (25th to 75th percentile), and the thick line in the middle of the box shows the median. There are better ways, which we will discuss on Day 5, but still, boxplots are pretty cool.
OK, now we run a t-test. We will use the t.test()
function for that, and store the result in a variable called t_test
.
<- t.test(group_a, group_b)
t_test typeof(t_test)
[1] "list"
As you can see, the result is a list. However, when we print it to the console, it does not look like one:
t_test
Welch Two Sample t-test
data: group_a and group_b
t = -5.5608, df = 16.908, p-value = 3.519e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.368135 -3.313697
sample estimates:
mean of x mean of y
9.09666 14.43758
This is because R has a special function for formatting the results of a t-test2, so that it is easier to read.
2 OK, this is way beyond the scope of this course, but the result returned by t.test has the class “htest”. This is a list, but also it is something special, which is why R knows to print it in a different way. R is a functional language, but it also allows object oriented (OO) programming.
Nonetheless, we can access the elements of the list in the usual way:
$p.value t_test
[1] 3.518805e-05
2.4.4 Replacing, adding and removing elements of a list
You can assign elements to a list using the $
sign. If the element does not exist yet in the list, it will be created; if it does, it will be replaced.
To remove an element, you need to use the special value NULL
.
<- list(name = "January", age = 117, pets=c("cat", "dog"))
person
# change the age
$age <- 118
person
# add a new element
$city <- "Hoppegarten"
person
# remove pets
$pets <- NULL
person person
$name
[1] "January"
$age
[1] 118
$city
[1] "Hoppegarten"
2.5 Data Frames
2.5.1 Creating data frames
Finally, we come to possibly the most important data structure in R – at least for us, biologists and medical researchers: data frames. Data frames are the closest thing to an Excel spreadsheet in R. They are used to store data in a tabular form, where each column can be of a different type. This makes them perfect for storing data from experiments, clinical trials, etc. You will be using them a lot.
In R, data frames are lists that were made to behave a lot like matrices. Thus, everything that you learned so far about lists can be applied to data frames, including accessing their elements (columns) using the $
operator. However, there are some differences between data frames and matrices.
The main feature of data frame that makes them a bit like matrices is the fact that each element of a data frame is a vector3 and that these vectors have always the same length. This means that one of the major differences between data frames and, say, Excel spreadsheets, is that a column of a data frame contains only elements of a single type. If a “cell” in a data frame is a character string, then the whole column is a character vector; if it is numerical, then the whole column is a numerical vector, etc.
3 Actually, it is a bit more complicated than that, but for now, let’s just say that each column of a data frame is a vector.
This may seem like a limitation, but it is, in fact, a good thing. It makes data more consistent and less prone to errors.
Like lists, data frames can be named or not, but typically they are. The names of a data frame are precisely the column names and you can access them (and modify) using both, names
and colnames
functions.
<- c("January", "William", "Bill")
names <- c("Weiner", "Shakespeare", "Gates")
lastn <- c(NA, 460, 65)
age
<- data.frame(names=names, last_names=lastn, age=age)
people people
names last_names age
1 January Weiner NA
2 William Shakespeare 460
3 Bill Gates 65
names(people)
[1] "names" "last_names" "age"
colnames(people)
[1] "names" "last_names" "age"
Like matrices, data frames have dimensions, which you can access using the dim()
, nrow()
and ncol()
functions.
dim(people)
[1] 3 3
nrow(people)
[1] 3
ncol(people)
[1] 3
Also, like matrices, the data frames can have rownames; however, for reasons that will be clear later, they are not as important as column names and in fact, we will not be using them4.
4 Row names in R data frames are very old school. Many people still use them, and many R functions produce them. However, we will be using the packages from the tidyverse family further down the line, which ignore the rownames, and for good reasons.
2.5.2 Accessing and modifying columns of a data frame
Since data frames are lists, you can access their columns using the $
operator. This is the most common way of accessing columns of a data frame:
$names people
[1] "January" "William" "Bill"
You can also access columns using the square brackets, just like with matrices, using a comma to denote the columns and rows. However, there is a fine difference between matrices and data frames: if you select a single row, you will not get a vector, but a data frame with one row. If you think about that, it makes perfect sense: the different columns can have different data types, so they cannot be easily combined into a vector without losing some information.
1, ] people[
names last_names age
1 January Weiner NA
There is an issue when you try to access a single column of a data frame. We will discuss it at length in the following, but basically, this behavior is different for different flavors of data frames: base R data frames created with data.frame()
return a vector, while others may return a data frame. Watch out for this!
2.5.3 Subsetting data frames with logical vectors
Just like with vectors, you can use logical vectors to subset data frames. This is extremely useful and very common in R. For example, we might want to select only rows that do not contain NA values in the age
column.
<- people[!is.na(people$age), ]
people_with_age people_with_age
names last_names age
2 William Shakespeare 460
3 Bill Gates 65
That way we have filtered the data frame, leaving only persons with known age.
Actually, for data frames, you will commonly use the filter()
function (which we will introduce tomorrow). However:
- the
filter()
function also uses logical vectors; - you can subset many different data types using logical vectors (including matrices, lists, vectors, data frames), but
filter()
works only with data frames; and - sometimes using logical vectors is just more convenient.
2.6 Libraries in R
2.6.1 Installing and loading packages
Starting tomorrow, we will cease to only use “base” R functions (that is, functions that are available in R “out of the box”) and start using additional packages. Packages in R are collections of functions (and often some other things, like data sets, documents and more).
Packages need to be installed before they can be used – but once you have installed a package, you don’t need to install it again (unless you upgrade R or want to update the package to a newer version).
However, to use a package, you also need to load it using the library()
function. This you must do every time you start a new R session, because R “forgets” which packages have been loaded the previous time. It is a bit like with the software for your operating system: you need to install your browser only once, but you have to start each each time after you start your computer.
Installing packages usually is straightforward, however at times it can be tricky.
At times you will get a warning that a package was “built” under a different version of R than the one that you are running. It is not an error (just a warning), and most of the time it can be ignored. It means what it says: that the installed package was built (e.g., compiled) with a different version of R. This can sometimes lead to problems, but most of the time it does not.
We will return to installing packages from different sources on Day 5.
Remember: you need to install a package only once with install.package()
, but you need to load it every time you start a new R session with library()
.
2.6.2 Data frames, tibbles & co.: different flavors of R
It is now time to reveal some ugly truths about R. R is open source, and everyone can modify it, add new packages etc. This is great and resulted in the vibrant user community that R has. Also, there is hardly a statistical method or framework that is not represented in R. In addition, developing and publishing new packages for R is incredibly easy, at least compared to some other languages.
However, this has a downside. There are many different flavors of R, and, unfortunately, some of the most popular packages or groups of packages can clash. We will spend now some time with one particular example: data frames. Firstly, because you will be using data frames a lot, secondly, because it neatly illustrates the problem, and thirdly to introduce a new type of data – tibbles
.
Base R data frames are created using the data.frame()
function. They are useful and we use them a lot, but they have one tiny inconsistency that can cause a lot of trouble. Say, you define your data frame like this:
<- data.frame(a = 1:3, b = c("a", "b", "c")) df
When you access a single column of this data frame using the square brackets [ ]
, you will get a vector. We can check it with the is.data.frame()
function:
is.data.frame(df[ , 1])
[1] FALSE
This is consistent with the behavior of matrices (where one column or one row becomes a vector), but not with the behavior of the data frames themselves: because when you access a single row, you are getting a data frame, not a vector:
is.data.frame(df[1, ])
[1] TRUE
This inconsistency can mess up your code. Imagine that you have somehow automatically selected some columns of a data frame – for example, by selecting columns that start with a certain letter (we will learn how to do that on Day 4). You store it in the variable called sel_cols
. You can select the columns from the data frame using the df[ , sel_cols]
syntax. However, depending on whether there was a single column selected or more, you will get either a vector or a data frame. This is annoying and in the worst case scenario, it can break your program.
Many people noticed this, and proposed solutions. In R, it is possible to take a class of an object (like data.frame
) and modify it. One of such most commonly used modifications is called a tibble
and has been implemented in the Tidyverse group of packages, which we will be using extensively in the days to come.
2.6.3 Data frames and tibbles
The Tidyverse data frame is called a tibble
. It behaves almost exactly like a data frame, with a few crucial differences:
- when printing the tibble to the console, the output is nicer
- tibbles never have row names (and that is why we will not be using row names)
- when you access a single column of a tibble, you always get a tibble, not a vector
To use tibbles, you need the tidyverse
package5. You can install it using install.packages("tidyverse")
and load it using library(tidyverse)
.
5 Actually, the tidyverse
package is a meta-package: it just loads a collection of packages that are often used together. The tibble
package which defines tibbles is one of them.
library(tidyverse)
<- tibble(a = 1:3, b = c("a", "b", "c"))
tbl tbl
# A tibble: 3 × 2
a b
<int> <chr>
1 1 a
2 2 b
3 3 c
is.data.frame(tbl)
[1] TRUE
is_tibble(tbl)
[1] TRUE
As you can see, a tibble is both a data frame and a tibble. You can use tibbles as a drop-in replacement for data frames. We will be seeing it a lot, because many useful functions from the Tidyverse family produce tibbles, not data frames. However, for you, as the user, the differences will be mostly cosmetic. So far, so good.
However, if you are reading this book, chances are that you are a biologist or medical researcher, and that means that sooner or later you will be using packages from the Bioconductor project. Bioconductor is a collection of R packages for bioinformatics, genomics, and related fields. They are incredibly valuable, you can hardly do bioinformatics without them. However, Bioconductor defines its own alternative to data.frames, called DataFrame. It is a bit different from data frames, and it is not compatible with tibbles. If you want to process DataFrames produced by Bioconductor packages with Tidyverse functions, you need to convert them to a regular data.frame
using the as.data.frame()
function.
The code below will not work until you install the Bioconductor package S4Vectors. Don’t worry if it does not work – we will not be using Bioconductor in this course, but I wanted to show you where the problem is.
library(S4Vectors)
<- DataFrame(a = 1:3, b = c("a", "b", "c"))
DF DF
DataFrame with 3 rows and 2 columns
a b
<integer> <character>
1 1 a
2 2 b
3 3 c
is.data.frame(DF)
[1] FALSE
As you can see, the DataFrame
object is not a data.frame
. And this means that the Tidyverse function filter()
does not see it as one (you will learn about the filter()
function on Day 4):
library(tidyverse)
filter(DF, a > 1)
Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "c('DFrame', 'DataFrame', 'SimpleList', 'RectangularData', 'List', 'DataFrame_OR_NULL', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"
2.7 Review
Things you learned today:
- Logical vectors:
- subsetting with logical vectors
- using comparison operators like
>
to create logical vectors - using the
is.na()
function to check for NA values - using the
which()
function to find the positions ofTRUE
values - using the
!
operator to negate logical vectors
- Matrices:
- creating matrices using the
matrix()
function - measuring matrices with
dim()
,nrow()
andncol()
- rows first, then columns
- accessing elements, rows, columns and submatrices of a matrix
- naming rows and columns of a matrix
- creating matrices using the
- Lists:
- creating lists using the
list()
function - accessing elements of a list using
[
,[[
and$
- lists as return values from functions
- replacing, adding and removing elements of a list
- creating lists using the
- Data frames:
- creating data frames using the
data.frame()
function - accessing and modifying column names of a data frame
- accessing and modifying elements of a data frame
- subsetting data frames
- adding and removing columns from a data frame
- merging data frames
- creating tibbles with Tidyverse and
tibble()
- creating data frames using the
- Other:
- constant vectors
LETTERS
andletters
- using the
rep()
function - generating random numbers with
rnorm()
andrunif()
- generating sequences with
seq()
- runnning a t-test using function
t.test()
- making a boxplot with
boxplot()
- the special value
NULL
- converting matrices to data frames with
as.data.frame()
- installing packages with
install.packages()
- loading packages with
library()
- constant vectors