install.packages("tidyverse")
install.packages("readxl")
3 Reading and Writing Files
3.1 Aims for today
- Reading data
- Cleaning data
- Regular expressions
- Data management
Today is a special day. If there is one thing that I would like you to learn from this course, it is how to read and write the data.
3.2 Reading data
3.2.1 Data types
In your work you will encounter many different types of data. Most frequently, you will work with tabular data, either as Excel files or as comma or tab separated values (CSV and TSV files, respectively). I am sure you have worked with such files before.
To read these files, we will use two packages: readr
and readxl
. The former, readr
, is part of the tidyverse
package, so when you load the tidyverse using library(tidyverse)
, readr
is loaded as well. The latter, readxl
, is a separate package that you need to install and load separately.
readr
and readxl
packages There are also “base R” functions read.table
, read.csv
, read.tsv
(there is no function for reading XLS[X] files in base R). These are always available when you start R, but don’t use them. The tidyverse functions are not only faster, but also much better behaving and, which is most important, they are safer – it is less likely to mess up your data with them.
read.table()
, read.csv()
, read.tsv()
tidyverse functions return tibbles, which, as you remember from yesterday, are a special flavor of data frames. Just to refresh your memory, here are key facts about tibbles:
- in most of the cases, they behave exactly like data frames
- when you print them, they are nicer
- tibbles have no row names
- when you select columns using
[ , sel ]
, you always get a tibble, even if you selected only one column
read_table()
, read_tsv()
, read_csv()
, read_delim()
, read_xls()
, read_xlsx()
, read_excel()
Data type | Function | Package | Notes |
---|---|---|---|
Columns separated by spaces | read_table() |
readr /tidyverse |
one or more spaces separate each column |
TSV / TAB separated values | read_tsv() |
readr /tidyverse |
Delimiter is tab (\t ). |
CSV / comma separated | read_csv() |
readr /tidyverse |
Comma separated values |
Any delimiter | read_delim() |
readr /tidyverse |
Customizable |
XLS (old Excel) | read_xls() read_excel() |
readxl |
Avoid using XLS files. From the readxl package. |
XLSX (new Excel) | read_xlsx() read_excel() |
readxl |
From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame! |
As you can see, the functions we recommend to use can be used by loading the packages tidyverse
and readxl
. If you haven’t done that yet, please install these packages now:
However, before we start using these functions, we need to dive into a very important problem: where are your files?
- For reading text files (csv, tsv etc.), use the
readr
package. This package is loaded automatically when you load thetidyverse
package:library(tidyverse)
. Then, use the functionsread_csv
,read_tsv
etc. - For reading Excel files, use the
readxl
package:library(readxl)
. Then, use the functionread_excel
.
3.2.2 Where is my file? Relative and absolute paths
We are coming now to a crucial problem and a source of endless frustration for beginning R programmers: how to tell R where your file is. Fortunately, there are many ways to deal with that, both with R and RStudio. Still, this is a key problem and we would like you to spend some time on the following chapter.
In order to access a file, a program (any program) needs what is known a path. Path is a character string that tells the program how to get to the file and looks, for example, like this on a Windows computer: C:/Users/johndoe/Documents/RProjects/MyFirstRProject/data.csv
, and like this on a Mac: /Users/johndoe/Documents/RProjects/MyFirstRProject/data.csv
.
Most computer systems separate the directories and files in a path using a slash (/
). However, Windows uses a backslash (\
). This is quite annoying for R users, because in R character vectors, the slash has a special meaning. To use backslashes, you need to “escape” them by putting another backslash in front of each backslash. So instead of C:\Users\johndoe\Documents
, you need to write C:\\Users\\johndoe\\Documents
. Alternatively, you can use the forward slash even on a Windows system, so type C:/Users/johndoe/Documents
. We recommend the latter approach.
This is also why simply copying a path from the Windows Explorer to your R script will not work in R – because the copied text contains single backslashes.
If R cannot find your file, it will return an error message. At first, you will be seeing this error a lot:
library(tidyverse)
<- read_csv("data.csv") mydata
Error: 'data.csv' not found in the current working directory
('C:/Users/johndoe/Documents/RProjects/MyFirstRProject/').
Before we proceed, you need to understand one important thing. When you start your RStudio and R session, the R session runs in a specific directory. This is called the working directory. You can check what is using the getwd()
function1:
getwd()
1 You can change it using the setwd()
function, but avoid doing that. This path leads to madness, trust me on that.
getwd()
[1] "C:/Users/johndoe/Documents/RProjects/MyFirstRProject"
Of course, the result will be different on your computer. By the way, the above is called an absolute path. That means that it works no matter where you are in your system, because a program can always find the file or directory using this path.
The easiest way to read the data is this: copy your data files to the directory returned by getwd()
.
The following code should now work without an error:
library(tidyverse)
<- read_csv("iris.csv") iris_data
Now the contents of the file are stored in the iris_data
object. There are many ways to have a look at it:
- type
iris_data
orprint(iris_data)
in the console - type
View(iris_data)
in the console - click on the little spreadsheet icon next to the
iris_data
object in the Environment tab in RStudio (upper right panel)
Please make sure that the above works for you. If it does not, read the instructions again. In RStudio, there is a “Files” tab in the lower right panel. You should see your working directory as well as the “iris.csv” file there.
3.2.3 Reading with relative paths
OK, so far, so good. That was easy. Now comes a slightly harder part.
Saving your data files in the working directory works well if you have one or two. However, the more files you read and write, the more cluttered your project directory becomes. You will soon find yourself in a situation where you have no idea which file is which.
It is generally a good idea to keep your data files in a separate directory, or even multiple directories.
The path “Datasets/iris.csv” is called a relative path, because it is relative to the working directory and will not work from another location. So why should we use relative paths? Wouldn’t it be easier to use absolute paths all the time, for example read_csv("C:/Users/johndoe/Documents/RProjects/MyFirstRProject/Datasets/iris.csv")
?
Actually, no. The problem is that if you move your R project to another location, of if you share it with someone else, the absolute path will no longer work. In other words, the absolute path is not portable.
Do not use absolute paths in your code. Always use relative paths.
3.2.4 More on relative paths
Some times the data files are not in your R project directory. For example, you are writing your PhD thesis. You have created a directory called “PhD”, which contains directories “manuscript”, “data”, “images”, “R_project” and so on. You use R to do a part of the calculations, but you want to keep the data files in the “data” directory. How to read them?
When you type getwd()
, you will get the path to the “R_project” directory, something like C:/Users/johndoe/Documents/PhD/R_project
. The data files are in the directory C:/Users/johndoe/Documents/PhD/data
. To get the relative path from the R project directory to the data directory, think about how you would navigate from one directory to another in the Windows Explorer or Finder. You need to go up one level, to get into “PhD”, and then down again to “data”.
Getting “up one level” in a path is done by using ..
. So the relative path to the file “iris.csv” in your “data” directory is ../data/iris.csv
.
..
But what about the case when your data directory is at a completely different location? For example, on a different drive, or maybe on your desktop?
First, I don’t recommend keeping your data files separately from the R project directory. In general, try to put everything in one place, as part of one structure. This structure can be complex, but it should be coherent. If necessary, copy the data files into your R project directory.
However, sometimes it simply isn’t possible. Maybe the files are huge and you need to read them from a special drive. In this case, there are three options.
Using absolute paths. Yes, I told you not to use absolute paths, but sometimes you have no choice.
Create shortcuts. In all systems it is possible to create shortcuts to your data directories (on Unix-like systems like MacOS they are called “symbolic links”). You can put these shortcuts in your R project directory – R will treat them as normal directories.
Create a complex relative path. Depending on how far “away” your data directory is, you can create a complex relative path. For example, if your R project directory is C:/Users/johndoe/Documents/PhD/R_project
and your data directory is D:/Data
, you can use the path ../../../../Data/iris.csv
. Unfortunately, despite being a relative path, this is not very portable (and it is easy to lose count on the ..
’s).
3.2.5 Using RStudio to import files
RStudio has a very nice feature that allows you to import files using a point-and-click interface. When you click on a data file in the “Files” tab (lower right panel), you will see two options: “View File” and “Import Dataset”. Choosing the latter opens a dialog window which shows a preview of the file and allows you to construct your read_csv
, read_tsv
or another command using a point-and-click interface. You also see a preview of the expression that will be executed.
This feature is particularly useful when you are not sure about the format of the data to import, e.g. what type of delimiter is used, how many lines should be skipped etc.
Then you can click on “Import” to actually run the command, however I would recommend another approach. Clicking on “Import” runs the command directly in the console, bypassing your script – and you should always enter the code in your script before executing it.
Rather than clicking on “Import”, click on the little Copy icon next to the “Code preview” field, and then cancel the dialog. Paste the copied code into your script, and then run it.
There is one more thing to modify. The code generated by RStudio often uses absolute paths. Try to modify it to use relative paths to ensure portability of your script.
3.2.6 Reading Excel files
Reading files sometimes requires diligence. This is especially true for Excel files – they can contain multiple sheets, tables often do not start on the first row etc.
The package readxl
(which you hopefully successfully used to read the XLSX file in the previous exercise) contains several example files. They have been installed on your system when you installed the package. Manually finding these example files is annoying, and that is why the readxl
package provides a convenience function, readxl_example()
, that returns the absolute path to the file (yes, I know what I said about absolute paths; this is an exception).
readxl
packagereadxl_example()
,read_excel()
library(readxl)
<- readxl_example("deaths.xls")
fn print(fn)
[1] "/home/january/R/x86_64-pc-linux-gnu-library/4.4/readxl/extdata/deaths.xls"
<- read_excel(fn) deaths
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
head(deaths)
# A tibble: 6 × 6
`Lots of people` ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 simply cannot resist writing <NA> <NA> <NA> <NA> some not…
2 at the top <NA> of their sp…
3 or merging <NA> <NA> <NA> cells
4 Name Profession Age Has kids Date of birth Date of …
5 David Bowie musician 69 TRUE 17175 42379
6 Carrie Fisher actor 60 TRUE 20749 42731
If you view this file using the head()
function (or if you use the RStudio data import feature), you will notice that the actual data in this file starts on line 5; the first 4 lines contain the text “Lots of people simply cannot resist writing some notes at the top of their spreadsheets or merging cells”. To skip these lines, we need to use an argument to the read_excel
function. If you look up the help file for the function (e.g. using ?read_excel
), you will find the following fragment:
head()
read_excel(skip=...)
skip Minimum number of rows to skip before reading anything,
be it column names or data. Leading empty rows are
automatically skipped, so this is a lower bound. Ignored
if range is given.
Therefore, we can modify the code above to skip the first four lines:
<- read_excel(fn, skip=4)
deaths head(deaths)
# A tibble: 6 × 6
Name Profession Age `Has kids` `Date of birth` `Date of death`
<chr> <chr> <chr> <chr> <dttm> <chr>
1 David Bow… musician 69 TRUE 1947-01-08 00:00:00.000 42379
2 Carrie Fi… actor 60 TRUE 1956-10-21 00:00:00.000 42731
3 Chuck Ber… musician 90 TRUE 1926-10-18 00:00:00.000 42812
4 Bill Paxt… actor 61 TRUE 1955-05-17 00:00:00.000 42791
5 Prince musician 57 TRUE 1958-06-07 00:00:00.000 42481
6 Alan Rick… actor 69 FALSE 1946-02-21 00:00:00.000 42383
3.3 Diagnosing and cleaning data
3.3.1 Diagnosing datasets
Congratulations! You are now proficient in reading data into R. Believe me or not, this alone is an important step forward – many who try their hand at R get stuck at this point.
However, reading data is only the first step. In most of the cases, the data requires some treatment: cleaning, transformation, filtering etc. In many projects that I have worked on as a bioinformatician, data import, diagnosis and cleaning took up the majority of time spent on the project. Unfortunately, this is necessary before any real fun with the data can start.
Let us examine the file iris.csv
that you have just read. The dataset comes from a famous paper by Ronald Fisher, who used it to demonstrate his newly developed method called linear discriminant analysis – an early machine learning algorithm, if you will. The dataset contains measurements (in cm) of 150 flowers of three species of irises: Iris setosa, Iris versicolor and Iris virginica. The measurements are the sepal and petal length and width, four measurements in total. Each row consists of these four measurements, plus a column which contains the species name.
iris
datasetI have doctored the file to mimick typical problems with imported data, especially in clinical trial settings. Humans who enter data make errors, this is normal and expected, all of us do. Before we analyse them, we need to correct them. Before we correct them, we need to find them.
First, note that when you read the data using read_csv
, the function conveniently shows what types of data were assigned to each column:
<- read_csv("Datasets/iris.csv") iris_data
Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Sepal Length, Petal?Length, Species
dbl (2): Sepal Width, Petal.Width
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This should already ring some alarm bells. First, it looks like the columns are not consistently named: we have Sepal Length
and Petal?Length
, and also Petal.Width
.
Secondly, we would expect that the measurements are imported as numbers. However, both sepal and petal lengths are imported as characters. We can confirm this by using the class
function:
class()
class(iris_data[["Sepal Length"]])
[1] "character"
class(iris_data[["Sepal Width"]])
[1] "numeric"
class(iris_data[["Petal?Length"]])
[1] "character"
class(iris_data[["Petal.Width"]])
[1] "numeric"
class
and typeof
The class
function returns the class of an object, which is a higher-level classification of the object. An object can have multiple classes. The typeof
function returns the internal storage type of the object, which is a lower level classification. For example, both tibbles and data frames have the type list
, but their classes are different. Another example:i if mtx
is a matrix of numbers, typeof(mtx)
is double
, and class(mtx)
is matrix
.
Note that instead of using iris_data$Sepal Length
(which will not work, because there is a space in the column name), we use the double bracket notation. An alternative would be to use quotes: iris_data$'Sepal Length'
. This a reason why want to avoid spaces and special characters in column names (in a moment we will show you how to standardize column names). If you use tab-completion with RStudio, the quotes will be inserted automatically.
We can also use the summary
function on the whole dataset:
summary()
summary(iris_data)
Sepal Length Sepal Width Petal?Length Petal.Width
Length:150 Min. : 2.000 Length:150 Min. :0.100
Class :character 1st Qu.: 2.800 Class :character 1st Qu.:0.300
Mode :character Median : 3.000 Mode :character Median :1.300
Mean : 3.411 Mean :1.199
3rd Qu.: 3.375 3rd Qu.:1.800
Max. :36.000 Max. :2.500
Species
Length:150
Class :character
Mode :character
Ops, another problem. Did you notice the maximum value for sepal widths? It is 36 cm. This cannot be, way too large for the flowers. Also, both the mean and median are around 3 cm, so this must be either an outlier or a clerical error.
summary()
functions
Under the hood, there is no single summary()
function. Instead, different classes can have different types of summaries. Whenever you produce a result, always try the summary function with it.
Alternatively, we can use the tidyverse glimpse
function. Rather then providing a summary, this function uses a terse format to show the data type for each column and first few values:
glimpse()
glimpse(iris_data)
Rows: 150
Columns: 5
$ `Sepal Length` <chr> "5.1", "4.9", "4.7", "4.6", "5", "5.4", "4.6", "5", "4.…
$ `Sepal Width` <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, …
$ `Petal?Length` <chr> "1.4", "1.4", "1.3", "1.5", "1.4", "1.7", "1.4", "1.5",…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, …
$ Species <chr> "setosa", "setosa", "setosa", "setosa", "setosa", "seto…
Another way how we can diagnose the dataset is to use the str
function. This provides a more detailed summary for each column:
str()
str(iris_data)
spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Sepal Length: chr [1:150] "5.1" "4.9" "4.7" "4.6" ...
$ Sepal Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal?Length: chr [1:150] "1.4" "1.4" "1.3" "1.5" ...
$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
- attr(*, "spec")=
.. cols(
.. `Sepal Length` = col_character(),
.. `Sepal Width` = col_double(),
.. `Petal?Length` = col_character(),
.. Petal.Width = col_double(),
.. Species = col_character()
.. )
- attr(*, "problems")=<externalptr>
3.3.2 Using skim()
to diagnose datasets
Another option to diagnose the dataset is to use the skim
function from the skimr
package, which you installed (hopefully) yesterday. The skim
package provides a more detailed summary of the dataset, including the number of missing values, the number of unique values and summary statistics. To use it, you need first to load the package:
skim()
library(skimr)
This skim()
function shows more than just the str
function. In addition to column types, for character and factor columns it shows the unique values, ordered by their frequency. For numerical values, it shows their non-parametric summary statistics (range, median, quartiles):
skim(iris_data)
Name | iris_data |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Sepal Length | 0 | 1 | 1 | 5 | 0 | 37 | 0 |
Petal?Length | 0 | 1 | 1 | 4 | 0 | 47 | 0 |
Species | 0 | 1 | 6 | 10 | 0 | 6 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sepal Width | 0 | 1 | 3.41 | 3.16 | 2.0 | 2.8 | 3.0 | 3.38 | 36.0 | ▇▁▁▁▁ |
Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.3 | 1.80 | 2.5 | ▇▁▇▅▃ |
Hm, when you look at the output you might notice one more thing: for the last column, “Species”, skim()
shows that there are six unique values (take a look at the n_unique
column). However, how can that be? We know that there are only three species in this dataset. We can check this using the unique
function:
unique()
unique(iris_data[["Species"]])
[1] "setosa" "Setosa" "versicolor" "Versicolor" "virginica"
[6] "Virginica"
There it is. Looks like whoever typed the data2, sometimes used a lower-case species designation, and sometimes upper-case. However, for R (and for computers in general), “versicolor” and “Versicolor” are two different things.
2 I hate to say it, it was probably me.
More information can be gained using the table()
function:
table()
table(iris_data[["Species"]])
setosa Setosa versicolor Versicolor virginica Virginica
45 5 42 8 46 4
Here we not only see the unique values, but also their frequency. The table()
function is very useful for any categorical data, as you will see later.
For computers in general and R in particular, “lowercase” and “Uppercase” are two different things. Variables a
and A
are different, as are versicolor
and Versicolor
. This is called case sensitivity.
3.3.3 Checking individual values
We have seen before that some measurement columns were imported as character vectors, while others were correctly imported as numbers. To understand why, we need to have a closer look at the individual columns.
One of the columns that has been imported as a character vector is the “Sepal Length”. We will extract the column as a vector and then try to manually convert it to a number:
as.numeric()
<- iris_data[["Sepal Length"]]
sepal_length as.numeric(sepal_length)
Warning: NAs introduced by coercion
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 NA 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 NA 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
Two things to note: first, R issued a warning, “NAs introduced by coercion”. That means that some values could not be converted to numbers, and were replaced by the special value NA
3. Second, there was no error. R happily allowed you to create a vector of numbers from a vector even though there were some problems.
3 See Day 1 for more information on NA
values.
When you look at the output, you will easily spot the NA
values. You could even figure out at which positions in the vector they occur. However, this is not the way to go – what if you have not 150 values, but 150,000? Instead, use the is.na
function:
is.na()
# convert to numbers
<- as.numeric(sepal_length) sepal_length_n
Warning: NAs introduced by coercion
# how many are NA?
sum(is.na(sepal_length_n))
[1] 2
# which are NA?
which(is.na(sepal_length_n))
[1] 70 119
# show the values
is.na(sepal_length_n)] sepal_length[
[1] "5,6" "> 7.7"
One more problem remains: the value “36” in the sepal width column. We can check which values are greater than 10 with a similar approach:
<- iris_data[["Sepal Width"]]
sepal_width
# how many are greater than 10?
sum(sepal_width > 10)
[1] 2
# which are greater than 10?
which(sepal_width > 10)
[1] 42 110
# show the values
> 10] sepal_width[sepal_width
[1] 23 36
3.3.4 Diagnosing datasets: a checklist
Whenever you read a data set, you should check the following things:
Column names. Are they consistent? Are they easy to type (no spaces, no special characters)? Are they in the correct order? Are they what you expect them to be? Do you understand what each column is?
Data types. Are the columns of the correct type (e.g. numerical columns imported as numbers)?
Categorical variables. Do you see what you expect? Is the number of categories correct? Do you understand what each category is?
Numerical variables. Do the summary statistics make sense? Are there any outliers? Missing values? Were the variables imported incorrectly as characters? Why?
Missing values. Are there any missing values? Why? Are they important? How to deal with them?
- Column names
- Data types
- Categorical variables
- Numerical variables
- Missing values
3.4 Mending the data
3.4.1 Correcting column names
Column names should contain only letters, numbers and underscores. While all other characters are also allowed (so a column name can be 豚肉 😀 "temp."? (°C)
), it is way easier to work with columns with simple, standardized names.
You already know one way of dealing with this – simply assign new names to the columns:
colnames(iris_data) <- c("sepal_length", "sepal_width", "petal_length",
"petal_width", "species")
colnames(iris_data)
[1] "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
However, there is a quicker way. The janitor
package contains a function called clean_names
that does exactly that, automatically and for all columns. Install it with install.packages("janitor")
if you haven’t installed it yet.
janitor
packageclean_names()
library(janitor)
<- clean_names(iris_data)
iris_data colnames(iris_data)
[1] "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
My advice is to use clean_names
on every dataset you import.
3.4.2 Correcting outliers
In the column with sepal widths, which now has the name sepal_width
, we found two values that probably lack the decimal point: 23 and 36. They are at positions 42 and 110, respectively.
It is tempting to simply substitute the data in the original data frame:
# works, but don't do it
$sepal_width[42] <- 2.3 iris_data
Unfortunately, this is not a good idea, because this solution is a bit like manual editing in a spreadsheet. There is a better way. First, generalize the issue. What seems to be the problem? The values that are larger than 10 are probably missing a decimal point. The simplest solution is to divide them by 10. Using logical vectors, we can do that for all the values that are larger than 10.
# numbers that are missing a decimal point
<- iris_data$sepal_width > 10
too_large
$sepal_width[too_large] <- iris_data$sepal_width[too_large] / 10 iris_data
This a bit hard to read, so we make it more explicit:
# numbers that are missing a decimal point
<- iris_data$sepal_width
sepal_width <- sepal_width > 10
too_large
<- sepal_width[too_large] / 10
sepal_width[too_large] sepal_width[too_large]
numeric(0)
Looks good. Finally, we assign the corrected values back to the data frame:
$sepal_width <- sepal_width iris_data
As the last step, we need to check whether our approach worked for all values in the column. For this, we will use the any()
function, which returns TRUE
if any of the values in a logical vector is TRUE
:
any()
any(iris_data$sepal_width > 10)
[1] FALSE
3.4.3 Correcting categorical data
In the column “species”, we found that there are two different values for the same species. There are many ways we can handle it. Obviously, we could manually assign the correct labels, but that would be a lot of work.
A better way would be to somehow automatically convert all the values. In our case, the problem is quite simple: it is sufficient to put all the values in uniform lower case, so that Versicolor
becomes versicolor
.
We can do that in one line of code thanks to the tolower
function:
tolower()
$species <- tolower(iris_data$species) iris_data
There is also a corresponding function, toupper
, that converts all the letters to upper case.
toupper()
Remember to check that your changes were successful:
table(iris_data$species)
setosa versicolor virginica
50 50 50
This was a simple case, but what if we had more complex typos? For example, “i.versicolor”, “vers.”, “V” etc. In such cases, you would need to use a more sophisticated approach. This approach involves regular expressions, and will also help us to correct the sepal and petal lengths.
3.4.4 Incorrectly imported numeric data
We have two columns which were imported as characters, but should be numbers. Their brand-new names are sepal_length
and petal_length
. All in all, there are 2 problems in the sepal length column and 5 in the petal length column, as shown in the table below:
Column | Position | Original | Correct |
---|---|---|---|
sepal_length | 70 | 5,6 |
5.6 |
sepal_length | 119 | > 7.7 |
7.7 |
petal_length | 12 | 1.6! |
1.6 |
petal_length | 13 | 1 .4 |
1.4 |
petal_length | 14 | 1.1? |
1.1 |
petal_length | 115 | 5k.1 |
5.1 |
petal_length | 149 | 5. 4 |
5.4 |
The last column in the table above is our target – this is how we would like to correct the data. How can we do that?
Of course, without additional information we can only guess that 5,6
should be 5.6
. Maybe there were two measurements, 5 and 6? Maybe 5. 4
should be 5.04
, and not 5.4
? In a real-world scenario, you would need to consult the author(s) of the data, or check your lab book.
It is tempting to simply substitute the data in the original data frame:
# works, but don't do it
$sepal_length[70] <- 5.6 iris_data
As before, that is not a good idea. If you receive another version of the file, with added lines before the 70th, the position will change and you will have to correct the version manually again. Also, if there are new problems, like yet another typo in the data, you will have to spot it and manually add a line to correct it. Again, we need to generalize the issue.
What is the problem on line 70? A comma instead of a decimal dot – maybe someone who typed it was used to entering numbers in a German version of Excel, where a comma is used instead of a dot. So why not replace all the commas by a dot? This should do the trick not only for line 70, but also for possible future problems.
We will use for that the function str_replace_all
from tidyverse4. Before we do that, however, we will make a copy of the column. That will allow us to easily see the changes will make, check that everything is OK and only then replace the column in the data frame. Later on, you will learn more efficient ways of doing this.
str_replace_all()
4 In base R, you can use the gsub
function. However, it has a slightly different syntax, which makes it less convenient when you start using pipes tomorrow.
# make a copy
<- iris_data$sepal_length
sepal_length
# record the problematic places
<- is.na(as.numeric(sepal_length)) problems
Warning: NAs introduced by coercion
sepal_length[problems]
[1] "5,6" "> 7.7"
# replace the comma
<- str_replace_all(sepal_length, ",", ".")
sepal_length
# what was the result?
sepal_length[problems]
[1] "5.6" "> 7.7"
It worked! The str_replace_all
is a general search-and-replace function. You can always use it with character vectors to replace stuff.
However, we still have the problem with > 7.7
. You will find such values quite often in clinical settings, along with the counterpart < 7.7
. These may indicate that the measurement was out of range of an instrument. However, we cannot usually treat them as missing values, because they contain some information. Depending on whether or not we want to keep this information, we could either replace them by NAs, or decide to keep the value as is (in this case, change > 7.7
to 7.7
).
In the latter case, we can use the str_replace_all
function again:
<- str_replace_all(sepal_length, "> ", "") sepal_length
In any case, the last point that remains is to convert the vector to numbers and assign it to the column holding sepal lengths. We do it in one go and also check if everything is OK:
# finalize
$sepal_length <- as.numeric(sepal_length)
iris_data
# check whether the column is numeric
is.numeric(iris_data$sepal_length)
[1] TRUE
# check whether our problems are gone
$sepal_length[problems] iris_data
[1] 5.6 7.7
# check whether there are any NA's
any(is.na(iris_data$sepal_length))
[1] FALSE
The new function, is.numeric()
, checks whether the column sepal_length
is indeed numeric. Finally, we make sure that no NA’s were produced in the conversion.
as.numeric()
3.5 Regular expressions
3.5.1 Introduction to regular expressions
Regular expressions, regexps for short, are for finding certain patterns in the data and handling them automatically. They are quite old (going back at least to 1960’s) and have remained virtually the same for many decades. You will find them in almost all programming languages, and while details may differ, the principles are usually the same. The regexps in R are very similar to regular expressions in other programming languages such as Python. As you may guess, they have a different syntax then R – they are something else altogether.
We will not go in depth in regular expressions in R here. Here, just a short primer.
Before we start, let us introduce a new function called str_detect()
5. The function is available as soon as you load the tidyverse
package. This function is used to search for a pattern in a character vector. For every element that matches the pattern, it will return a TRUE
, otherwise a FALSE
. Say, we have a vector (or data frame column) containing sample names, and we want to find all the controls.
str_detect()
5 In base R, there is the grep()
function. It is very similar, but has a different syntax. We will stick to str_detect()
for now.
<- c("ko_1_ctrl", "ko_2_ctrl", "ko_1_treat", "ko_2_treat",
samples "wt_1_ctrl", "wt_2_ctrl", "wt_1_treat", "wt_2_treat")
str_detect(samples, "ctrl")
[1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
str_detect(samples, "ctrl")] samples[
[1] "ko_1_ctrl" "ko_2_ctrl" "wt_1_ctrl" "wt_2_ctrl"
In case that someone did not pay attention to lower or upper case, we can tell str_detect to ignore the case:
<- c("ko_1_ctrl", "ko_2_CTRL", "ko_1_treat", "ko_2_treat",
samples "wt_1_CTRL", "wt_2_ctrl", "wt_1_treat", "wt_2_treat")
# this does not work
str_detect(samples, "ctrl")
[1] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# but this does
str_detect(samples, regex("ctrl", ignore_case=TRUE))
[1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
Here we are looking for a literal string “ctrl”. However, we can use regular expressions to search for more complex patterns. What if the sample names are more hap-hazard, like ko_1_control
, ko_2_kontrol
, ko_1_ctrl
?
<- c("ko_1_control", "ko_2_ctrl", "ko_1_trt", "ko_2_treatment",
samples "wt_1_Control", "wt_2_kontrol", "wt_1_Trtmt", "wt_2_treated")
OK, so we got control
, ctrl
, Control
and kontrol
. A pattern emerges:
- first letter is either
k
orc
orC
- then we might have an
on
(or not) - next always comes
tr
- then we might have an
o
- then we always have an
l
- then the word ends.
All this “mights” and “ors” and “eithers” and “always” can be encoded in a regular expression:
Pattern | Explanation |
---|---|
[kcC] |
must have one of the letters k , c or C |
o? |
zero or one o (i.e., “there might be an o”) |
n? |
zero or one n (i.e., “there might be an n”) |
tr |
must have literally the string tr |
o? |
zero or one o (i.e., “there might be an o”) |
l |
must have literally the string l |
$ |
here must be the end of the string |
Taken together, this gives us the regular expression [kcC]o?n?tro?l$
. This looks weird, but has the magic power of matching all the control-like strings in our vector. Let us try it:
str_detect(samples, "[kcC]o?n?tro?l$")
[1] TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
str_detect(samples, "[kcC]o?n?tro?l$")] samples[
[1] "ko_1_control" "ko_2_ctrl" "wt_1_Control" "wt_2_kontrol"
This works, but… what does that all mean?
3.5.2 How regexp works
A regular expression is a sequence of characters – a string. However, some of these strings have a special meaning. Here is a short table, we will discuss these in more detail in a moment:
Character | Meaning |
---|---|
a |
a literal character a |
. |
any character |
[abc] |
one of the characters a , b or c |
[^abc] |
any character except a , b or c |
[a-z] |
any character from a to z |
[0-9] |
any digit |
These are the so-called “atoms”. They stand for something that we want to match. In the simplest case (like the a
above), they stand for themselves. Also, did you notice the square brackets? They have nothing to do with how we use square brackets in R.
In addition we can specify how many times we want to match the atom:
Quantifier | Meaning |
---|---|
? |
zero or one |
* |
zero or more |
+ |
at least one |
Then, we can “anchor” the pattern in our string:
Anchor | Meaning |
---|---|
^ |
start of the string |
$ |
end of the string |
Let us put that together. Say we have the following strings:
<- c("a", "ab", "ac", "ad", "bc", "abc", "bac", "cab") strings
To match these which contain an a
, we can use the regular expression a
:
str_detect(strings, "a")] strings[
[1] "a" "ab" "ac" "ad" "abc" "bac" "cab"
To match these which start with an a
, we can use the regular expression ^a
:
str_detect(strings, "^a")] strings[
[1] "a" "ab" "ac" "ad" "abc"
To find these which start with an a
followed by b
or c
, but not d
, we can use the square brackets:
str_detect(strings, "^a[bc]")] strings[
[1] "ab" "ac" "abc"
OK, that was a lot. Take some time to digest it. Regulare expressions are what you call in German gewöhnungsbedürftig – they require some getting used to, but they are worth it.
3.5.3 Usage of regular expressions in R
Most commonly you will use regular expressions in R for two things:
- Searching for something in a character vector (like we did above)
- Replacing something in a character vector, for example using
str_replace_all
What we did not tell you when we introduced str_replace_all
is that it actually, its second argument is a regular expression. Therefore, we can use it to find a pattern and replace it with something else. For example, we can unify the following messed up vector denoting the gender of patients:
<- c("m", "f", "m", "w", "frau",
gender "female", "other", "male",
"männlich", "x", "weiblich")
German and English are mixed here, but we can see a pattern: if the strings starts with m
, it is male, if it starts with either f
or w
, it is female. And in remaining cases it is “other”. We can clean up this mess with just three lines of code:
<- str_replace(gender, "^m.*", "male")
gender <- str_replace(gender, "^[fw].*", "female")
gender <- str_replace(gender, "^[^mfw].*", "other") gender
Note that the ^
character has two different meanings on line 3 above. As the first character of a regular expression, it anchors it to the beginning of the string. However, inside the square brackets, it negates the selection. So, [^mfw]
means “any character except m
, f
or w
”.
The .*
idiom is a common one in regular expressions. It means “zero or more of any character”. So, ^m.*
matches both the string “m” and the string “male”. And because we used ^
, it will only match the strings where m is the first letter (so it does not match “female”).
But wait, if some characters have a special meaning, how can we replace them? For example, how can we replace a dot in a string? The following will not work as intended:
<- c("5.6", "5.7", "5.8")
vec str_replace_all(vec, ".", ",")
[1] ",,," ",,," ",,,"
Since the character .
means “any character”, every character will be replaced by a comma in the example above. In order to search or replace special characters, we must escape them – which in R means putting two backslashes6 in front of them in the regular expression.
6 In R, it is two backslashes. In other programming languages, it is usually a single backslash.
<- c("5.6", "5.7", "5.8")
vec str_replace_all(vec, "\\.", ",")
[1] "5,6" "5,7" "5,8"
3.5.4 Correcting columns in iris_data
with regular expressions
Let us now turn to another column in the data frame, the petal lengths. Using the approach we have just learned, we can find the problematic values:
<- iris_data$petal_length
petal_length <- is.na(as.numeric(petal_length)) problems
Warning: NAs introduced by coercion
which(problems)
[1] 12 13 14 115 149
petal_length[problems]
[1] "1.6!" "1 .4" "1.1?" "5k.1" "5. 4"
We could use for example str_replace_all(sepal_length, "k", "")
, to change 5k.1
to 5.1
, but that would not be a general solution. What if it is j
in the new data? We should actually remove everything that is not a number and not a decimal dot. To this end, we will use the regular expressions.
We need to remove anything but numbers and decimal dots. We can use the square brackets for that:
<- str_replace_all(petal_length, "[^0-9.]", "") petal_length
The 0-9
means anything between 0 and 9, and the .
is a literal dot. As you can see, you don’t have to escape the dot if it is already in the square brackets. The ^
inside the square brackets negates the selection, so [^0-9.]
means exactly what we wanted.
[^0-9]
# check the problems
petal_length[problems]
[1] "1.6" "1.4" "1.1" "5.1" "5.4"
# convert to numbers
<- as.numeric(petal_length)
petal_length
# check for remaining NA's
any(is.na(petal_length))
[1] FALSE
# assign
$petal_length <- as.numeric(petal_length) iris_data
Done! The iris data set is now clean.
skim(iris_data)
Name | iris_data |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
species | 0 | 1 | 6 | 10 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
sepal_length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |
sepal_width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |
petal_length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |
petal_width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |
3.5.5 Just a quick look
The newly cleaned up data set begs for a quick look. We already created a simple plot before using the plot()
function, and today I will show you how to use the ggplot2
package to create somewhat nicer plots. In this case, it is rather trivial.
ggplot2
library(ggplot2)
ggplot(iris_data, aes(x=sepal_length, y=sepal_width, color=species)) +
geom_point()
As you can see, the ggplot
function takes two arguments: the data frame and something which is calles aesthetics and which is produced by the aes()
function. The aes()
function serves as a kind of link between the data and the plot, showing how what should be plotted. In this case, we tell aes()
that the x-axis should be the sepal length, the y-axis the sepal width, and the color should be determined by the species.
Then, a magical thing happens. Whatever is produced by ggplot()
is not a number, but nonetheless can be added with a +
sign to other stuff. In this case, we add a geom_point()
function, which tells ggplot
to plot points at the x, y coordinates (there are many others, and you will learn about them the day after tomorrow).
There, all done. See how I. setosa is different from the other two?
3.6 Writing data
3.6.1 Keep your data organized
Writing data is way easier than reading it. However, you always have to come up with a name for the file, and I would like you to follow a few points.
- Never overwrite the original file (we will discuss it in more detail shortly).
- Create an output directory and save your files only there, do not mix the original and the output files.
- Use a meaningful name, so not
output
, butiris_cleaned
. No spaces or special characters (use an underscore_
instead). - Thirdly, always version your files: add a date, your initials, a version number and a date, e.g.
iris_cleaned_2021-09-01_JW_v1.csv
. - Do not use the old Excel XLS format (file extension
.xls
). Use the newer XLSX format (file extension.xlsx
). - When using Excel to enter data, read this first.
This latter point warrants an explanation. There are two main problems with that format. Firstly, it is less portable then XLSX (the new one) – so many programs can read XLSX, but not XLS (or, which is worse, they can, but read it incorrectly). Secondly, the number of rows and columns in an XLS file is severly limited (65536 rows and 256 columns). These limits are easily reached in modern bioinformatic datasets.
3.6.2 Functions to use for writing data
write_tsv()
, write_csv()
,write_xlsx()
Data type | Function | Package | Notes |
---|---|---|---|
TSV / TAB separated values | write_tsv() |
readr |
No rownames! |
CSV / comma separated | write_csv() |
readr |
No rownames! |
XLS (old Excel) | Just don’t use it. No, seriously, don’t. | ||
XLSX (new Excel) | write_xlsx() |
writexl |
No rownames! |
Just as in the case of reading data, there are several functions to write data. There are also functions in the base R that can be used for writing (e.g. write.csv
), but they are not recommended. We advise you to only use the readr
package for writing TSV and CSV files, and the writexl
package for writing XLSX files.
However, keep in mind that row names are not exported with these packages. That is why we do not use row names in this course.
3.7 Principles of data management
3.7.1 Keeping originals
As every day, we end the day with some food for thought. Today, we would like to introduce a few important principles that you should follow in your everyday work with data – not only in R, but in general.
The first rule is: always keep the originals. Whether you have received the data from someone, extracted them from a database such as RedCap or created it yourself, once you start working with R you should freeze your data, version it and not touch it anymore.
Let me reiterate: do not touch the original data. If you absolutely have to edit it by hand, make a copy and add a suffix like _edited
or your initials to the file. However, it is better to perform all editing operations in R.
Versioning. Always include a date, and possibly your initials and version in the file name. If you ever submitted a manuscript of a scientific paper, you know how quickly versions can accumulate. It is absolutely crucial to know which version of the data was used for which analysis.
Data editing. Every edit operation on your data should be somehow recorded. This is not different from a lab notebook, where you write down every step of your experiment. When working with a spreadsheet this is hard to do, therefore…
3.7.2 Use R for data editing
Optimally, your mode of operation should never involve editing data directly in a spreadsheet. You should read your data in R and perform all operations here, recording them in a script. This has two huge advantages:
- Reproducibility. You and others will be able to trace back all your operations. There will be no doubt whether and which data were edited and how.
- Automation. When you get updated data (and you will), for example when new samples arrived, you will be able to re-run your script (maybe with a few changes) and get the updated results with minimal effort. Otherwise, you will have to repeat every. Single. Edit. Again.
Consider the example of the iris.csv
file. We have seen that one of the values of sepal width seems to be missing a decimal dot. You could edit that in the Excel or CSV file directly. You might even make a note of that somewhere. However, chances are that you were wrong – maybe that was a flower wiht a 36 sepal width after all. This change influences your data and your analysis results and must be documented if you were to do science. If you did it in R, you probably have entered a comment in your script, something like this:
# There seems to be a missing decimal dot in some of the sepal widths
# Changing it to 1/10th of the value
<- iris_data[["Sepal Width"]]
sepal_width <- which(sepal_width > 10)
sel <- sepal_width[sel] / 10
sepal_width[sel]
"Sepal Width"]] <- sepal_width iris_data[[
This concisely informs the reader that a change was made, how and why.
And what happens when Ronald Fisher raises from his grave and collects another 150 flowers? You will get an updated data set with 300 flowers and import it again. In the best case scenario, you will find the problem again (or remember it). Then you will have to go into Excel and make the same edit again. And all the other edits as well. If you did it in R, you simply use your script again.
3.7.3 Working with Excel and other spreadsheets
We all use Excel, LibreOffice Calc or Google Sheets. They are great tools, no doubt about that. However, by now you can see that certain operations make working with R harder, and some even that can mess up your data.
If possible, you should actually avoid working with Excel when working with scientific data. Excel does certain operations automatically (depending on the settings), so it can change your data without leaving a trace of the changes. This is the precise opposite of what you want in science.
Excel can automatically replace strings with dates, like changing MARCH9
into 9th of March
. However, MARCH9
is a valid gene name. In fact, it turns out that a substantial proportion (about a third!) of all Excel files containing gene names and published as supplementary materials online contain gene names transformed to dates. Not only that, but even though that this has been discovered many years ago and even some genes were officialy renamed because of Excel, this is still a problem. And Excel is now able to recognize dates in many languages, exacerbating the problem (Abeysooriya et al. 2021).
However, sometimes it is unavoidable to work with Excel.
Don’t’s:
- Using color and font to encode information, for example marking the different treatments with different background color or formatting controls in bold. While it is possible to read this information in R, it is not trivial. In addition, even outside of R formatting of cells can be lost when the data is passed around, and this would mean that the information is lost.
- Comments in the same cells as values, for example
50 (measured twice)
. This prevents the columns to be interpreted as numbers and has to be dealt with in R. Create a separate column for comments, even it is just a single comment in a row. - Meta-data information in the header. Very often, column names contain additional information, for example units, dates, encoding (like “0 for male, 1 for female”). This makes it hard to both, import the data and actually use that information. Create a separate sheet in your spreadsheet for meta data, with one row per column name.
- Adding header and tail information. Just like in the example with
deaths.xlsx
file, additional lines in spreadsheets require you to carefully select what you wish to import. Avoid it if you can, use another sheet. - More than one table on a spreadsheet. A spreadsheet should contain a single table. Otherwise, just like in case of header and tail information, you will have to spend some time cautiously selecting the areas that you want to import.
- Merging cells. Merged cells are a nightmare to work with in R. If you have to, unmerge them before importing the data.
Do’s:
- Use a single table per sheet. If you have more than one table, use more than one sheet.
- Use a single header row. If you have more than one header row, use create another sheet with meta-data on the column names.
- Switch off automatic conversion. Excel can automatically change your data in many ways. For example, it can change gene names to dates. It can also change the decimal separator, which can mess up your data. Newer versions of Excel allow to switch off this behaviour, so do it (here are the instructions).
- Control your import. When importing data from CSV files to TSV files, Excel allows you to control which fields are imported as text, which as dates etc. Use this feature to make sure that your data is correctly imported.
3.8 Review
Things that you learned today:
- Reading and writing
read_csv
,read_tsv
,read_delim
,read_xls
,read_xlsx
- reading a file from working directory
- reading a file from a different directory using relative paths
- Paths
- Get the current working directory with
getwd()
- relative and absolute paths
- Get the current working directory with
- Diagnosing datasets
- data types:
str
,class
,typeof
- summaries:
summary
,glimpse
- categorical data:
unique
,table
skim()
from theskimr
package- diagnosis checklist
- data types:
- Correcting datasets
- converting character vectors to numbers with
as.numeric()
- checking a vector with
is.numeric()
- checking a logical vector with
any()
- replacing values with
str_replace_all()
- converting character vectors to numbers with
- Principles of data management
- keeping originals
- versioning
- data editing
- working with Excel
- Other
- using help pages to find out about functions and their parameters