3  Reading and Writing Files

3.1 Aims for today

  • Reading data
  • Cleaning data
  • Regular expressions
  • Data management

Today is a special day. If there is one thing that I would like you to learn from this course, it is how to read and write the data.

3.2 Reading data

3.2.1 Data types

In your work you will encounter many different types of data. Most frequently, you will work with tabular data, either as Excel files or as comma or tab separated values (CSV and TSV files, respectively). I am sure you have worked with such files before.

To read these files, we will use two packages: readr and readxl. The former, readr, is part of the tidyverse package, so when you load the tidyverse using library(tidyverse), readr is loaded as well. The latter, readxl, is a separate package that you need to install and load separately.

readr and readxl packages

There are also “base R” functions read.table, read.csv, read.tsv (there is no function for reading XLS[X] files in base R). These are always available when you start R, but don’t use them. The tidyverse functions are not only faster, but also much better behaving and, which is most important, they are safer – it is less likely to mess up your data with them.

read.table(), read.csv(), read.tsv()

tidyverse functions return tibbles, which, as you remember from yesterday, are a special flavor of data frames. Just to refresh your memory, here are key facts about tibbles:

tibbles
  • in most of the cases, they behave exactly like data frames
  • when you print them, they are nicer
  • tibbles have no row names
  • when you select columns using [ , sel ], you always get a tibble, even if you selected only one column

read_table(), read_tsv(), read_csv(), read_delim(), read_xls(), read_xlsx(), read_excel()
Data type Function Package Notes
Columns separated by spaces read_table() readr/tidyverse one or more spaces separate each column
TSV / TAB separated values read_tsv() readr/tidyverse Delimiter is tab (\t).
CSV / comma separated read_csv() readr/tidyverse Comma separated values
Any delimiter read_delim() readr/tidyverse Customizable
XLS (old Excel) read_xls() read_excel() readxl Avoid using XLS files. From the readxl package.
XLSX (new Excel) read_xlsx() read_excel() readxl From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame!

As you can see, the functions we recommend to use can be used by loading the packages tidyverse and readxl. If you haven’t done that yet, please install these packages now:

install.packages("tidyverse")
install.packages("readxl")

However, before we start using these functions, we need to dive into a very important problem: where are your files?

Remember!
  • For reading text files (csv, tsv etc.), use the readr package. This package is loaded automatically when you load the tidyverse package: library(tidyverse). Then, use the functions read_csv, read_tsv etc.
  • For reading Excel files, use the readxl package: library(readxl). Then, use the function read_excel.

3.2.2 Where is my file? Relative and absolute paths

We are coming now to a crucial problem and a source of endless frustration for beginning R programmers: how to tell R where your file is. Fortunately, there are many ways to deal with that, both with R and RStudio. Still, this is a key problem and we would like you to spend some time on the following chapter.

In order to access a file, a program (any program) needs what is known a path. Path is a character string that tells the program how to get to the file and looks, for example, like this on a Windows computer: C:/Users/johndoe/Documents/RProjects/MyFirstRProject/data.csv, and like this on a Mac: /Users/johndoe/Documents/RProjects/MyFirstRProject/data.csv.

Path
Path separators on different systems

Most computer systems separate the directories and files in a path using a slash (/). However, Windows uses a backslash (\). This is quite annoying for R users, because in R character vectors, the slash has a special meaning. To use backslashes, you need to “escape” them by putting another backslash in front of each backslash. So instead of C:\Users\johndoe\Documents, you need to write C:\\Users\\johndoe\\Documents. Alternatively, you can use the forward slash even on a Windows system, so type C:/Users/johndoe/Documents. We recommend the latter approach.

This is also why simply copying a path from the Windows Explorer to your R script will not work in R – because the copied text contains single backslashes.

If R cannot find your file, it will return an error message. At first, you will be seeing this error a lot:

Error: file not found
library(tidyverse)
mydata <- read_csv("data.csv")
Error: 'data.csv' not found in the current working directory
('C:/Users/johndoe/Documents/RProjects/MyFirstRProject/').

Before we proceed, you need to understand one important thing. When you start your RStudio and R session, the R session runs in a specific directory. This is called the working directory. You can check what is using the getwd() function1:

getwd()

1 You can change it using the setwd() function, but avoid doing that. This path leads to madness, trust me on that.

getwd()
[1] "C:/Users/johndoe/Documents/RProjects/MyFirstRProject"

Of course, the result will be different on your computer. By the way, the above is called an absolute path. That means that it works no matter where you are in your system, because a program can always find the file or directory using this path.

The easiest way to read the data is this: copy your data files to the directory returned by getwd().

Exercise 3.1 (Reading your first file)  

  • Check your working directory using getwd()
  • Load the tidyverse package using library(tidyverse)
  • Go to the URL https://github.com/bihealth/RCrashcourse-book/Datasets
  • Click on “iris.csv”
  • Click on the “Download raw file” button on the right side of the screen
  • Save the file in the directory returned by getwd()
  • Read the file using read_csv("iris.csv")

The following code should now work without an error:

library(tidyverse)
iris_data <- read_csv("iris.csv")

Now the contents of the file are stored in the iris_data object. There are many ways to have a look at it:

  • type iris_data or print(iris_data) in the console
  • type View(iris_data) in the console
  • click on the little spreadsheet icon next to the iris_data object in the Environment tab in RStudio (upper right panel)

Please make sure that the above works for you. If it does not, read the instructions again. In RStudio, there is a “Files” tab in the lower right panel. You should see your working directory as well as the “iris.csv” file there.

3.2.3 Reading with relative paths

OK, so far, so good. That was easy. Now comes a slightly harder part.

Saving your data files in the working directory works well if you have one or two. However, the more files you read and write, the more cluttered your project directory becomes. You will soon find yourself in a situation where you have no idea which file is which.

It is generally a good idea to keep your data files in a separate directory, or even multiple directories.

Exercise 3.2 (Reading your first file from a data directory)  

  • In the working directory, create a new directory called “Datasets”
  • Move the “iris.csv” file to the “Datasets” directory
  • Read the file using read_csv("Datasets/iris.csv")

The path “Datasets/iris.csv” is called a relative path, because it is relative to the working directory and will not work from another location. So why should we use relative paths? Wouldn’t it be easier to use absolute paths all the time, for example read_csv("C:/Users/johndoe/Documents/RProjects/MyFirstRProject/Datasets/iris.csv")?

Relative path

Actually, no. The problem is that if you move your R project to another location, of if you share it with someone else, the absolute path will no longer work. In other words, the absolute path is not portable.

Remember!

Do not use absolute paths in your code. Always use relative paths.

3.2.4 More on relative paths

Some times the data files are not in your R project directory. For example, you are writing your PhD thesis. You have created a directory called “PhD”, which contains directories “manuscript”, “data”, “images”, “R_project” and so on. You use R to do a part of the calculations, but you want to keep the data files in the “data” directory. How to read them?

When you type getwd(), you will get the path to the “R_project” directory, something like C:/Users/johndoe/Documents/PhD/R_project. The data files are in the directory C:/Users/johndoe/Documents/PhD/data. To get the relative path from the R project directory to the data directory, think about how you would navigate from one directory to another in the Windows Explorer or Finder. You need to go up one level, to get into “PhD”, and then down again to “data”.

Getting “up one level” in a path is done by using ... So the relative path to the file “iris.csv” in your “data” directory is ../data/iris.csv.

Up one level with ..

Exercise 3.3 (Reading a file from a data directory using relative paths)  

  • In the directory that contains the working directory create a directory called “Data”. That is, if your working directory is C:/Users/johndoe/Documents/PhD/R_project, create the directory C:/Users/johndoe/Documents/PhD/Data
  • Move the “iris.csv” file to the new “Data” directory
  • Read the file using read_csv("../Data/iris.csv")

But what about the case when your data directory is at a completely different location? For example, on a different drive, or maybe on your desktop?

First, I don’t recommend keeping your data files separately from the R project directory. In general, try to put everything in one place, as part of one structure. This structure can be complex, but it should be coherent. If necessary, copy the data files into your R project directory.

However, sometimes it simply isn’t possible. Maybe the files are huge and you need to read them from a special drive. In this case, there are three options.

Using absolute paths. Yes, I told you not to use absolute paths, but sometimes you have no choice.

Create shortcuts. In all systems it is possible to create shortcuts to your data directories (on Unix-like systems like MacOS they are called “symbolic links”). You can put these shortcuts in your R project directory – R will treat them as normal directories.

Create a complex relative path. Depending on how far “away” your data directory is, you can create a complex relative path. For example, if your R project directory is C:/Users/johndoe/Documents/PhD/R_project and your data directory is D:/Data, you can use the path ../../../../Data/iris.csv. Unfortunately, despite being a relative path, this is not very portable (and it is easy to lose count on the ..’s).

3.2.5 Using RStudio to import files

RStudio has a very nice feature that allows you to import files using a point-and-click interface. When you click on a data file in the “Files” tab (lower right panel), you will see two options: “View File” and “Import Dataset”. Choosing the latter opens a dialog window which shows a preview of the file and allows you to construct your read_csv, read_tsv or another command using a point-and-click interface. You also see a preview of the expression that will be executed.

This feature is particularly useful when you are not sure about the format of the data to import, e.g. what type of delimiter is used, how many lines should be skipped etc.

Then you can click on “Import” to actually run the command, however I would recommend another approach. Clicking on “Import” runs the command directly in the console, bypassing your script – and you should always enter the code in your script before executing it.

Rather than clicking on “Import”, click on the little Copy icon next to the “Code preview” field, and then cancel the dialog. Paste the copied code into your script, and then run it.

There is one more thing to modify. The code generated by RStudio often uses absolute paths. Try to modify it to use relative paths to ensure portability of your script.

Exercise 3.4 (Reading data)  

  • Go to the URL https://github.com/bihealth/RCrashcourse-book/Datasets
  • Download the following files:
    • iris.csv
    • meta_data_botched.xlsx
    • transcriptomics_results.csv
  • Alternatively, you can download the whole repository as a ZIP file and unpack it.
  • Save the files in the Datasets/ directory in your working directory – or another location of your choice. From now on, I will assume that this is the location of your data files.
  • Read the files using the appriopriate functions. Consult the table above for the correct function names, or use the RStudio data import feature. Make sure that you are using relative paths.

3.2.6 Reading Excel files

Reading files sometimes requires diligence. This is especially true for Excel files – they can contain multiple sheets, tables often do not start on the first row etc.

The package readxl (which you hopefully successfully used to read the XLSX file in the previous exercise) contains several example files. They have been installed on your system when you installed the package. Manually finding these example files is annoying, and that is why the readxl package provides a convenience function, readxl_example(), that returns the absolute path to the file (yes, I know what I said about absolute paths; this is an exception).

Reading excel files with readxl package

readxl_example(),read_excel()
library(readxl)
fn <- readxl_example("deaths.xls")
print(fn)
[1] "/home/january/R/x86_64-pc-linux-gnu-library/4.4/readxl/extdata/deaths.xls"
deaths <- read_excel(fn)
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
head(deaths)
# A tibble: 6 × 6
  `Lots of people`             ...2       ...3  ...4     ...5          ...6     
  <chr>                        <chr>      <chr> <chr>    <chr>         <chr>    
1 simply cannot resist writing <NA>       <NA>  <NA>     <NA>          some not…
2 at                           the        top   <NA>     of            their sp…
3 or                           merging    <NA>  <NA>     <NA>          cells    
4 Name                         Profession Age   Has kids Date of birth Date of …
5 David Bowie                  musician   69    TRUE     17175         42379    
6 Carrie Fisher                actor      60    TRUE     20749         42731    

If you view this file using the head() function (or if you use the RStudio data import feature), you will notice that the actual data in this file starts on line 5; the first 4 lines contain the text “Lots of people simply cannot resist writing some notes at the top of their spreadsheets or merging cells”. To skip these lines, we need to use an argument to the read_excel function. If you look up the help file for the function (e.g. using ?read_excel), you will find the following fragment:

head()

skipping lines with read_excel(skip=...)
skip    Minimum number of rows to skip before reading anything,
        be it column names or data. Leading empty rows are 
        automatically skipped, so this is a lower bound. Ignored
        if range is given.

Therefore, we can modify the code above to skip the first four lines:

deaths <- read_excel(fn, skip=4)
head(deaths)
# A tibble: 6 × 6
  Name       Profession Age   `Has kids` `Date of birth`         `Date of death`
  <chr>      <chr>      <chr> <chr>      <dttm>                  <chr>          
1 David Bow… musician   69    TRUE       1947-01-08 00:00:00.000 42379          
2 Carrie Fi… actor      60    TRUE       1956-10-21 00:00:00.000 42731          
3 Chuck Ber… musician   90    TRUE       1926-10-18 00:00:00.000 42812          
4 Bill Paxt… actor      61    TRUE       1955-05-17 00:00:00.000 42791          
5 Prince     musician   57    TRUE       1958-06-07 00:00:00.000 42481          
6 Alan Rick… actor      69    FALSE      1946-02-21 00:00:00.000 42383          

Exercise 3.5 (Reading data with options) If you take a closer look at the file deaths.xls, you will notice that there is some extra text at the bottom of the file as well. How can you omit that part when reading? Hint: there are two ways to do that. If in doubt, look up the “examples” section of the readxl helpfile.

(Solution)

3.3 Diagnosing and cleaning data

3.3.1 Diagnosing datasets

Congratulations! You are now proficient in reading data into R. Believe me or not, this alone is an important step forward – many who try their hand at R get stuck at this point.

However, reading data is only the first step. In most of the cases, the data requires some treatment: cleaning, transformation, filtering etc. In many projects that I have worked on as a bioinformatician, data import, diagnosis and cleaning took up the majority of time spent on the project. Unfortunately, this is necessary before any real fun with the data can start.

Let us examine the file iris.csv that you have just read. The dataset comes from a famous paper by Ronald Fisher, who used it to demonstrate his newly developed method called linear discriminant analysis – an early machine learning algorithm, if you will. The dataset contains measurements (in cm) of 150 flowers of three species of irises: Iris setosa, Iris versicolor and Iris virginica. The measurements are the sepal and petal length and width, four measurements in total. Each row consists of these four measurements, plus a column which contains the species name.

The iris dataset

I have doctored the file to mimick typical problems with imported data, especially in clinical trial settings. Humans who enter data make errors, this is normal and expected, all of us do. Before we analyse them, we need to correct them. Before we correct them, we need to find them.

First, note that when you read the data using read_csv, the function conveniently shows what types of data were assigned to each column:

iris_data <- read_csv("Datasets/iris.csv")
Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Sepal Length, Petal?Length, Species
dbl (2): Sepal Width, Petal.Width

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This should already ring some alarm bells. First, it looks like the columns are not consistently named: we have Sepal Length and Petal?Length, and also Petal.Width.

Secondly, we would expect that the measurements are imported as numbers. However, both sepal and petal lengths are imported as characters. We can confirm this by using the class function:

class()
class(iris_data[["Sepal Length"]])
[1] "character"
class(iris_data[["Sepal Width"]])
[1] "numeric"
class(iris_data[["Petal?Length"]])
[1] "character"
class(iris_data[["Petal.Width"]])
[1] "numeric"
class and typeof

The class function returns the class of an object, which is a higher-level classification of the object. An object can have multiple classes. The typeof function returns the internal storage type of the object, which is a lower level classification. For example, both tibbles and data frames have the type list, but their classes are different. Another example:i if mtx is a matrix of numbers, typeof(mtx) is double, and class(mtx) is matrix.

Note that instead of using iris_data$Sepal Length (which will not work, because there is a space in the column name), we use the double bracket notation. An alternative would be to use quotes: iris_data$'Sepal Length'. This a reason why want to avoid spaces and special characters in column names (in a moment we will show you how to standardize column names). If you use tab-completion with RStudio, the quotes will be inserted automatically.

We can also use the summary function on the whole dataset:

summary()
summary(iris_data)
 Sepal Length        Sepal Width     Petal?Length        Petal.Width   
 Length:150         Min.   : 2.000   Length:150         Min.   :0.100  
 Class :character   1st Qu.: 2.800   Class :character   1st Qu.:0.300  
 Mode  :character   Median : 3.000   Mode  :character   Median :1.300  
                    Mean   : 3.411                      Mean   :1.199  
                    3rd Qu.: 3.375                      3rd Qu.:1.800  
                    Max.   :36.000                      Max.   :2.500  
   Species         
 Length:150        
 Class :character  
 Mode  :character  
                   
                   
                   

Ops, another problem. Did you notice the maximum value for sepal widths? It is 36 cm. This cannot be, way too large for the flowers. Also, both the mean and median are around 3 cm, so this must be either an outlier or a clerical error.

The summary() functions

Under the hood, there is no single summary() function. Instead, different classes can have different types of summaries. Whenever you produce a result, always try the summary function with it.

Alternatively, we can use the tidyverse glimpse function. Rather then providing a summary, this function uses a terse format to show the data type for each column and first few values:

glimpse()
glimpse(iris_data)
Rows: 150
Columns: 5
$ `Sepal Length` <chr> "5.1", "4.9", "4.7", "4.6", "5", "5.4", "4.6", "5", "4.…
$ `Sepal Width`  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, …
$ `Petal?Length` <chr> "1.4", "1.4", "1.3", "1.5", "1.4", "1.7", "1.4", "1.5",…
$ Petal.Width    <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, …
$ Species        <chr> "setosa", "setosa", "setosa", "setosa", "setosa", "seto…

Another way how we can diagnose the dataset is to use the str function. This provides a more detailed summary for each column:

str()
str(iris_data)
spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Sepal Length: chr [1:150] "5.1" "4.9" "4.7" "4.6" ...
 $ Sepal Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal?Length: chr [1:150] "1.4" "1.4" "1.3" "1.5" ...
 $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Sepal Length` = col_character(),
  ..   `Sepal Width` = col_double(),
  ..   `Petal?Length` = col_character(),
  ..   Petal.Width = col_double(),
  ..   Species = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

3.3.2 Using skim() to diagnose datasets

Another option to diagnose the dataset is to use the skim function from the skimr package, which you installed (hopefully) yesterday. The skim package provides a more detailed summary of the dataset, including the number of missing values, the number of unique values and summary statistics. To use it, you need first to load the package:

skim()
library(skimr)

This skim() function shows more than just the str function. In addition to column types, for character and factor columns it shows the unique values, ordered by their frequency. For numerical values, it shows their non-parametric summary statistics (range, median, quartiles):

skim(iris_data)
Data summary
Name iris_data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
character 3
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Sepal Length 0 1 1 5 0 37 0
Petal?Length 0 1 1 4 0 47 0
Species 0 1 6 10 0 6 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal Width 0 1 3.41 3.16 2.0 2.8 3.0 3.38 36.0 ▇▁▁▁▁
Petal.Width 0 1 1.20 0.76 0.1 0.3 1.3 1.80 2.5 ▇▁▇▅▃

Hm, when you look at the output you might notice one more thing: for the last column, “Species”, skim() shows that there are six unique values (take a look at the n_unique column). However, how can that be? We know that there are only three species in this dataset. We can check this using the unique function:

unique()
unique(iris_data[["Species"]])
[1] "setosa"     "Setosa"     "versicolor" "Versicolor" "virginica" 
[6] "Virginica" 

There it is. Looks like whoever typed the data2, sometimes used a lower-case species designation, and sometimes upper-case. However, for R (and for computers in general), “versicolor” and “Versicolor” are two different things.

2 I hate to say it, it was probably me.

More information can be gained using the table() function:

table()
table(iris_data[["Species"]])

    setosa     Setosa versicolor Versicolor  virginica  Virginica 
        45          5         42          8         46          4 

Here we not only see the unique values, but also their frequency. The table() function is very useful for any categorical data, as you will see later.

Lower and upper case

For computers in general and R in particular, “lowercase” and “Uppercase” are two different things. Variables a and A are different, as are versicolor and Versicolor. This is called case sensitivity.

3.3.3 Checking individual values

We have seen before that some measurement columns were imported as character vectors, while others were correctly imported as numbers. To understand why, we need to have a closer look at the individual columns.

One of the columns that has been imported as a character vector is the “Sepal Length”. We will extract the column as a vector and then try to manually convert it to a number:

as.numeric()
sepal_length <- iris_data[["Sepal Length"]]
as.numeric(sepal_length)
Warning: NAs introduced by coercion
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2  NA 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7  NA 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

Two things to note: first, R issued a warning, “NAs introduced by coercion”. That means that some values could not be converted to numbers, and were replaced by the special value NA3. Second, there was no error. R happily allowed you to create a vector of numbers from a vector even though there were some problems.

3 See Day 1 for more information on NA values.

When you look at the output, you will easily spot the NA values. You could even figure out at which positions in the vector they occur. However, this is not the way to go – what if you have not 150 values, but 150,000? Instead, use the is.na function:

is.na()
# convert to numbers
sepal_length_n <- as.numeric(sepal_length)
Warning: NAs introduced by coercion
# how many are NA?
sum(is.na(sepal_length_n))
[1] 2
# which are NA?
which(is.na(sepal_length_n))
[1]  70 119
# show the values
sepal_length[is.na(sepal_length_n)]
[1] "5,6"   "> 7.7"

One more problem remains: the value “36” in the sepal width column. We can check which values are greater than 10 with a similar approach:

sepal_width <- iris_data[["Sepal Width"]]

# how many are greater than 10?
sum(sepal_width > 10)
[1] 2
# which are greater than 10?
which(sepal_width > 10)
[1]  42 110
# show the values
sepal_width[sepal_width > 10]
[1] 23 36

Exercise 3.6 (Petal lengths) Repeat the steps above for the Petal length column.

3.3.4 Diagnosing datasets: a checklist

Whenever you read a data set, you should check the following things:

Column names. Are they consistent? Are they easy to type (no spaces, no special characters)? Are they in the correct order? Are they what you expect them to be? Do you understand what each column is?

Data types. Are the columns of the correct type (e.g. numerical columns imported as numbers)?

Categorical variables. Do you see what you expect? Is the number of categories correct? Do you understand what each category is?

Numerical variables. Do the summary statistics make sense? Are there any outliers? Missing values? Were the variables imported incorrectly as characters? Why?

Missing values. Are there any missing values? Why? Are they important? How to deal with them?

Checklist for importing data
  • Column names
  • Data types
  • Categorical variables
  • Numerical variables
  • Missing values

Exercise 3.7 (Botched metadata) Load the file meta_data_botched.xlsx using the readxl package. Diagnose the problems.

(Solution)

3.4 Mending the data

3.4.1 Correcting column names

Column names should contain only letters, numbers and underscores. While all other characters are also allowed (so a column name can be 豚肉 😀 "temp."? (°C)), it is way easier to work with columns with simple, standardized names.

You already know one way of dealing with this – simply assign new names to the columns:

colnames(iris_data) <- c("sepal_length", "sepal_width", "petal_length",
                      "petal_width", "species")
colnames(iris_data)
[1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "species"     

However, there is a quicker way. The janitor package contains a function called clean_names that does exactly that, automatically and for all columns. Install it with install.packages("janitor") if you haven’t installed it yet.

janitor packageclean_names()
library(janitor)
iris_data <- clean_names(iris_data)
colnames(iris_data)
[1] "sepal_length" "sepal_width"  "petal_length" "petal_width"  "species"     

My advice is to use clean_names on every dataset you import.

3.4.2 Correcting outliers

In the column with sepal widths, which now has the name sepal_width, we found two values that probably lack the decimal point: 23 and 36. They are at positions 42 and 110, respectively.

It is tempting to simply substitute the data in the original data frame:

# works, but don't do it
iris_data$sepal_width[42] <- 2.3

Unfortunately, this is not a good idea, because this solution is a bit like manual editing in a spreadsheet. There is a better way. First, generalize the issue. What seems to be the problem? The values that are larger than 10 are probably missing a decimal point. The simplest solution is to divide them by 10. Using logical vectors, we can do that for all the values that are larger than 10.

# numbers that are missing a decimal point
too_large <- iris_data$sepal_width > 10

iris_data$sepal_width[too_large] <- iris_data$sepal_width[too_large] / 10

This a bit hard to read, so we make it more explicit:

# numbers that are missing a decimal point
sepal_width <- iris_data$sepal_width
too_large <- sepal_width > 10

sepal_width[too_large] <- sepal_width[too_large] / 10
sepal_width[too_large]
numeric(0)

Looks good. Finally, we assign the corrected values back to the data frame:

iris_data$sepal_width <- sepal_width

As the last step, we need to check whether our approach worked for all values in the column. For this, we will use the any() function, which returns TRUE if any of the values in a logical vector is TRUE:

any()
any(iris_data$sepal_width > 10)
[1] FALSE

Exercise 3.8 Can you spot what the potential problem with this approach is? Hint: what would happen if the original value was 1.10? Can you think of a better approach? Is it actually possible to make sure that our guess is correct?

3.4.3 Correcting categorical data

In the column “species”, we found that there are two different values for the same species. There are many ways we can handle it. Obviously, we could manually assign the correct labels, but that would be a lot of work.

A better way would be to somehow automatically convert all the values. In our case, the problem is quite simple: it is sufficient to put all the values in uniform lower case, so that Versicolor becomes versicolor.

We can do that in one line of code thanks to the tolower function:

tolower()
iris_data$species <- tolower(iris_data$species)

There is also a corresponding function, toupper, that converts all the letters to upper case.

toupper()

Remember to check that your changes were successful:

table(iris_data$species)

    setosa versicolor  virginica 
        50         50         50 

This was a simple case, but what if we had more complex typos? For example, “i.versicolor”, “vers.”, “V” etc. In such cases, you would need to use a more sophisticated approach. This approach involves regular expressions, and will also help us to correct the sepal and petal lengths.

3.4.4 Incorrectly imported numeric data

We have two columns which were imported as characters, but should be numbers. Their brand-new names are sepal_length and petal_length. All in all, there are 2 problems in the sepal length column and 5 in the petal length column, as shown in the table below:

Column Position Original Correct
sepal_length 70 5,6 5.6
sepal_length 119 > 7.7 7.7
petal_length 12 1.6! 1.6
petal_length 13 1 .4 1.4
petal_length 14 1.1? 1.1
petal_length 115 5k.1 5.1
petal_length 149 5. 4 5.4

The last column in the table above is our target – this is how we would like to correct the data. How can we do that?

Correcting values

Of course, without additional information we can only guess that 5,6 should be 5.6. Maybe there were two measurements, 5 and 6? Maybe 5. 4 should be 5.04, and not 5.4? In a real-world scenario, you would need to consult the author(s) of the data, or check your lab book.

It is tempting to simply substitute the data in the original data frame:

# works, but don't do it
iris_data$sepal_length[70] <- 5.6

As before, that is not a good idea. If you receive another version of the file, with added lines before the 70th, the position will change and you will have to correct the version manually again. Also, if there are new problems, like yet another typo in the data, you will have to spot it and manually add a line to correct it. Again, we need to generalize the issue.

What is the problem on line 70? A comma instead of a decimal dot – maybe someone who typed it was used to entering numbers in a German version of Excel, where a comma is used instead of a dot. So why not replace all the commas by a dot? This should do the trick not only for line 70, but also for possible future problems.

We will use for that the function str_replace_all from tidyverse4. Before we do that, however, we will make a copy of the column. That will allow us to easily see the changes will make, check that everything is OK and only then replace the column in the data frame. Later on, you will learn more efficient ways of doing this.

str_replace_all()

4 In base R, you can use the gsub function. However, it has a slightly different syntax, which makes it less convenient when you start using pipes tomorrow.

# make a copy
sepal_length <- iris_data$sepal_length

# record the problematic places
problems <- is.na(as.numeric(sepal_length))
Warning: NAs introduced by coercion
sepal_length[problems]
[1] "5,6"   "> 7.7"
# replace the comma 
sepal_length <- str_replace_all(sepal_length, ",", ".")

# what was the result?
sepal_length[problems]
[1] "5.6"   "> 7.7"

It worked! The str_replace_all is a general search-and-replace function. You can always use it with character vectors to replace stuff.

However, we still have the problem with > 7.7. You will find such values quite often in clinical settings, along with the counterpart < 7.7. These may indicate that the measurement was out of range of an instrument. However, we cannot usually treat them as missing values, because they contain some information. Depending on whether or not we want to keep this information, we could either replace them by NAs, or decide to keep the value as is (in this case, change > 7.7 to 7.7).

In the latter case, we can use the str_replace_all function again:

sepal_length <- str_replace_all(sepal_length, "> ", "")

In any case, the last point that remains is to convert the vector to numbers and assign it to the column holding sepal lengths. We do it in one go and also check if everything is OK:

# finalize
iris_data$sepal_length <- as.numeric(sepal_length)

# check whether the column is numeric
is.numeric(iris_data$sepal_length)
[1] TRUE
# check whether our problems are gone
iris_data$sepal_length[problems]
[1] 5.6 7.7
# check whether there are any NA's
any(is.na(iris_data$sepal_length))
[1] FALSE

The new function, is.numeric(), checks whether the column sepal_length is indeed numeric. Finally, we make sure that no NA’s were produced in the conversion.

as.numeric()

Exercise 3.9 Find the problems in the following vector and correct them:

vec <- c(" 5", "5,6", "5.7", "5. 4", "> 5.0", "6.O")

3.5 Regular expressions

3.5.1 Introduction to regular expressions

Regular expressions, regexps for short, are for finding certain patterns in the data and handling them automatically. They are quite old (going back at least to 1960’s) and have remained virtually the same for many decades. You will find them in almost all programming languages, and while details may differ, the principles are usually the same. The regexps in R are very similar to regular expressions in other programming languages such as Python. As you may guess, they have a different syntax then R – they are something else altogether.

Regular expressions

We will not go in depth in regular expressions in R here. Here, just a short primer.

Before we start, let us introduce a new function called str_detect()5. The function is available as soon as you load the tidyverse package. This function is used to search for a pattern in a character vector. For every element that matches the pattern, it will return a TRUE, otherwise a FALSE. Say, we have a vector (or data frame column) containing sample names, and we want to find all the controls.

str_detect()

5 In base R, there is the grep() function. It is very similar, but has a different syntax. We will stick to str_detect() for now.

samples <- c("ko_1_ctrl", "ko_2_ctrl", "ko_1_treat", "ko_2_treat",
            "wt_1_ctrl", "wt_2_ctrl", "wt_1_treat", "wt_2_treat")
str_detect(samples, "ctrl")
[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
samples[str_detect(samples, "ctrl")]
[1] "ko_1_ctrl" "ko_2_ctrl" "wt_1_ctrl" "wt_2_ctrl"

In case that someone did not pay attention to lower or upper case, we can tell str_detect to ignore the case:

samples <- c("ko_1_ctrl", "ko_2_CTRL", "ko_1_treat", "ko_2_treat",
            "wt_1_CTRL", "wt_2_ctrl", "wt_1_treat", "wt_2_treat")
# this does not work
str_detect(samples, "ctrl")
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
# but this does
str_detect(samples, regex("ctrl", ignore_case=TRUE))
[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE

Here we are looking for a literal string “ctrl”. However, we can use regular expressions to search for more complex patterns. What if the sample names are more hap-hazard, like ko_1_control, ko_2_kontrol, ko_1_ctrl?

samples <- c("ko_1_control", "ko_2_ctrl", "ko_1_trt", "ko_2_treatment",
            "wt_1_Control", "wt_2_kontrol", "wt_1_Trtmt", "wt_2_treated")

OK, so we got control, ctrl, Control and kontrol. A pattern emerges:

  • first letter is either k or c or C
  • then we might have an on (or not)
  • next always comes tr
  • then we might have an o
  • then we always have an l
  • then the word ends.

All this “mights” and “ors” and “eithers” and “always” can be encoded in a regular expression:

Pattern Explanation
[kcC] must have one of the letters k, c or C
o? zero or one o (i.e., “there might be an o”)
n? zero or one n (i.e., “there might be an n”)
tr must have literally the string tr
o? zero or one o (i.e., “there might be an o”)
l must have literally the string l
$ here must be the end of the string

Taken together, this gives us the regular expression [kcC]o?n?tro?l$. This looks weird, but has the magic power of matching all the control-like strings in our vector. Let us try it:

str_detect(samples, "[kcC]o?n?tro?l$")
[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
samples[str_detect(samples, "[kcC]o?n?tro?l$")]
[1] "ko_1_control" "ko_2_ctrl"    "wt_1_Control" "wt_2_kontrol"

This works, but… what does that all mean?

Exercise 3.10 (Quick question) Look at the table above. Can you think of other patterns that would match this regular expression? Would kotrl work? What about cotro?

3.5.2 How regexp works

A regular expression is a sequence of characters – a string. However, some of these strings have a special meaning. Here is a short table, we will discuss these in more detail in a moment:

regexp rules
Character Meaning
a a literal character a
. any character
[abc] one of the characters a, b or c
[^abc] any character except a, b or c
[a-z] any character from a to z
[0-9] any digit

These are the so-called “atoms”. They stand for something that we want to match. In the simplest case (like the a above), they stand for themselves. Also, did you notice the square brackets? They have nothing to do with how we use square brackets in R.

In addition we can specify how many times we want to match the atom:

Quantifier Meaning
? zero or one
* zero or more
+ at least one

Then, we can “anchor” the pattern in our string:

Anchor Meaning
^ start of the string
$ end of the string

Let us put that together. Say we have the following strings:

strings <- c("a", "ab", "ac", "ad", "bc", "abc", "bac", "cab")

To match these which contain an a, we can use the regular expression a:

strings[str_detect(strings, "a")]
[1] "a"   "ab"  "ac"  "ad"  "abc" "bac" "cab"

To match these which start with an a, we can use the regular expression ^a:

strings[str_detect(strings, "^a")]
[1] "a"   "ab"  "ac"  "ad"  "abc"

To find these which start with an a followed by b or c, but not d, we can use the square brackets:

strings[str_detect(strings, "^a[bc]")]
[1] "ab"  "ac"  "abc"

OK, that was a lot. Take some time to digest it. Regulare expressions are what you call in German gewöhnungsbedürftig – they require some getting used to, but they are worth it.

3.5.3 Usage of regular expressions in R

Most commonly you will use regular expressions in R for two things:

  • Searching for something in a character vector (like we did above)
  • Replacing something in a character vector, for example using str_replace_all

What we did not tell you when we introduced str_replace_all is that it actually, its second argument is a regular expression. Therefore, we can use it to find a pattern and replace it with something else. For example, we can unify the following messed up vector denoting the gender of patients:

gender <- c("m", "f", "m", "w", "frau",
            "female", "other", "male", 
            "männlich", "x", "weiblich")

German and English are mixed here, but we can see a pattern: if the strings starts with m, it is male, if it starts with either f or w, it is female. And in remaining cases it is “other”. We can clean up this mess with just three lines of code:

gender <- str_replace(gender, "^m.*", "male")
gender <- str_replace(gender, "^[fw].*", "female")
gender <- str_replace(gender, "^[^mfw].*", "other")

Note that the ^ character has two different meanings on line 3 above. As the first character of a regular expression, it anchors it to the beginning of the string. However, inside the square brackets, it negates the selection. So, [^mfw] means “any character except m, f or w”.

The .* idiom is a common one in regular expressions. It means “zero or more of any character”. So, ^m.* matches both the string “m” and the string “male”. And because we used ^, it will only match the strings where m is the first letter (so it does not match “female”).

Exercise 3.11 (Quick question) What would happen if we ommited the ^ at the beginning of the strings above? For example, if we used [^mfw].* instead of ^[^mfw].*? Think first, then try it out.

But wait, if some characters have a special meaning, how can we replace them? For example, how can we replace a dot in a string? The following will not work as intended:

vec <- c("5.6", "5.7", "5.8")
str_replace_all(vec, ".", ",")
[1] ",,," ",,," ",,,"

Since the character . means “any character”, every character will be replaced by a comma in the example above. In order to search or replace special characters, we must escape them – which in R means putting two backslashes6 in front of them in the regular expression.

escaping characters in a regexp

6 In R, it is two backslashes. In other programming languages, it is usually a single backslash.

vec <- c("5.6", "5.7", "5.8")
str_replace_all(vec, "\\.", ",")
[1] "5,6" "5,7" "5,8"

Exercise 3.12  

  • Use str_replace_all() to make the following uniform: c("male", "Male ", "M", "F", "female", " Female")
  • Using str_replace_all() and toupper(), clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
  • What regular expression matches all of the ankyrin repeat genes (but not other genes) in the following vector: c("ANKRD22", "ank.rep.d. 12", "ANKRD-33", "ankrd23", "ANKEN", "MAPK", "ifng-1", "ANKA-REP-6")? Ankyrin repeat domain genes are the first 4 in the vector.

3.5.4 Correcting columns in iris_data with regular expressions

Let us now turn to another column in the data frame, the petal lengths. Using the approach we have just learned, we can find the problematic values:

petal_length <- iris_data$petal_length
problems <- is.na(as.numeric(petal_length))
Warning: NAs introduced by coercion
which(problems)
[1]  12  13  14 115 149
petal_length[problems]
[1] "1.6!" "1 .4" "1.1?" "5k.1" "5. 4"

We could use for example str_replace_all(sepal_length, "k", ""), to change 5k.1 to 5.1, but that would not be a general solution. What if it is j in the new data? We should actually remove everything that is not a number and not a decimal dot. To this end, we will use the regular expressions.

We need to remove anything but numbers and decimal dots. We can use the square brackets for that:

petal_length <- str_replace_all(petal_length, "[^0-9.]", "")

The 0-9 means anything between 0 and 9, and the . is a literal dot. As you can see, you don’t have to escape the dot if it is already in the square brackets. The ^ inside the square brackets negates the selection, so [^0-9.] means exactly what we wanted.

[^0-9]
# check the problems
petal_length[problems]
[1] "1.6" "1.4" "1.1" "5.1" "5.4"
# convert to numbers
petal_length <- as.numeric(petal_length)

# check for remaining NA's
any(is.na(petal_length))
[1] FALSE
# assign
iris_data$petal_length <- as.numeric(petal_length)

Done! The iris data set is now clean.

skim(iris_data)
Data summary
Name iris_data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1 6 10 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sepal_length 0 1 5.84 0.83 4.3 5.1 5.80 6.4 7.9 ▆▇▇▅▂
sepal_width 0 1 3.06 0.44 2.0 2.8 3.00 3.3 4.4 ▁▆▇▂▁
petal_length 0 1 3.76 1.77 1.0 1.6 4.35 5.1 6.9 ▇▁▆▇▂
petal_width 0 1 1.20 0.76 0.1 0.3 1.30 1.8 2.5 ▇▁▇▅▃

Exercise 3.13 (Correcting metadata) In Exercise 3.7, you have diagnosed the file meta_data_botched.xlsx. Now go ahead and correct the problems you have found.

3.5.5 Just a quick look

The newly cleaned up data set begs for a quick look. We already created a simple plot before using the plot() function, and today I will show you how to use the ggplot2 package to create somewhat nicer plots. In this case, it is rather trivial.

ggplot2
library(ggplot2)
ggplot(iris_data, aes(x=sepal_length, y=sepal_width, color=species)) +
  geom_point()

As you can see, the ggplot function takes two arguments: the data frame and something which is calles aesthetics and which is produced by the aes() function. The aes() function serves as a kind of link between the data and the plot, showing how what should be plotted. In this case, we tell aes() that the x-axis should be the sepal length, the y-axis the sepal width, and the color should be determined by the species.

Then, a magical thing happens. Whatever is produced by ggplot() is not a number, but nonetheless can be added with a + sign to other stuff. In this case, we add a geom_point() function, which tells ggplot to plot points at the x, y coordinates (there are many others, and you will learn about them the day after tomorrow).

There, all done. See how I. setosa is different from the other two?

3.6 Writing data

3.6.1 Keep your data organized

Writing data is way easier than reading it. However, you always have to come up with a name for the file, and I would like you to follow a few points.

  • Never overwrite the original file (we will discuss it in more detail shortly).
  • Create an output directory and save your files only there, do not mix the original and the output files.
  • Use a meaningful name, so not output, but iris_cleaned. No spaces or special characters (use an underscore _ instead).
  • Thirdly, always version your files: add a date, your initials, a version number and a date, e.g. iris_cleaned_2021-09-01_JW_v1.csv.
  • Do not use the old Excel XLS format (file extension .xls). Use the newer XLSX format (file extension .xlsx).
  • When using Excel to enter data, read this first.

This latter point warrants an explanation. There are two main problems with that format. Firstly, it is less portable then XLSX (the new one) – so many programs can read XLSX, but not XLS (or, which is worse, they can, but read it incorrectly). Secondly, the number of rows and columns in an XLS file is severly limited (65536 rows and 256 columns). These limits are easily reached in modern bioinformatic datasets.

3.6.2 Functions to use for writing data

write_tsv(), write_csv(),write_xlsx()
Data type Function Package Notes
TSV / TAB separated values write_tsv() readr No rownames!
CSV / comma separated write_csv() readr No rownames!
XLS (old Excel) Just don’t use it. No, seriously, don’t.
XLSX (new Excel) write_xlsx() writexl No rownames!

Just as in the case of reading data, there are several functions to write data. There are also functions in the base R that can be used for writing (e.g. write.csv), but they are not recommended. We advise you to only use the readr package for writing TSV and CSV files, and the writexl package for writing XLSX files.

However, keep in mind that row names are not exported with these packages. That is why we do not use row names in this course.

Exercise 3.14 (Writing data)  

  • In your project directory, create the directory “Data_clean” (if you want, you can use the R function dir.create() for that).
  • Write the cleaned iris_data to a new file in the “Data_clean” directory. Use the write_csv function from the readr package. Use a meaningful name, version it and use a date.

3.7 Principles of data management

3.7.1 Keeping originals

As every day, we end the day with some food for thought. Today, we would like to introduce a few important principles that you should follow in your everyday work with data – not only in R, but in general.

The first rule is: always keep the originals. Whether you have received the data from someone, extracted them from a database such as RedCap or created it yourself, once you start working with R you should freeze your data, version it and not touch it anymore.

Let me reiterate: do not touch the original data. If you absolutely have to edit it by hand, make a copy and add a suffix like _edited or your initials to the file. However, it is better to perform all editing operations in R.

Versioning. Always include a date, and possibly your initials and version in the file name. If you ever submitted a manuscript of a scientific paper, you know how quickly versions can accumulate. It is absolutely crucial to know which version of the data was used for which analysis.

Data editing. Every edit operation on your data should be somehow recorded. This is not different from a lab notebook, where you write down every step of your experiment. When working with a spreadsheet this is hard to do, therefore…

3.7.2 Use R for data editing

Optimally, your mode of operation should never involve editing data directly in a spreadsheet. You should read your data in R and perform all operations here, recording them in a script. This has two huge advantages:

  • Reproducibility. You and others will be able to trace back all your operations. There will be no doubt whether and which data were edited and how.
  • Automation. When you get updated data (and you will), for example when new samples arrived, you will be able to re-run your script (maybe with a few changes) and get the updated results with minimal effort. Otherwise, you will have to repeat every. Single. Edit. Again.

Consider the example of the iris.csv file. We have seen that one of the values of sepal width seems to be missing a decimal dot. You could edit that in the Excel or CSV file directly. You might even make a note of that somewhere. However, chances are that you were wrong – maybe that was a flower wiht a 36 sepal width after all. This change influences your data and your analysis results and must be documented if you were to do science. If you did it in R, you probably have entered a comment in your script, something like this:

# There seems to be a missing decimal dot in some of the sepal widths
# Changing it to 1/10th of the value

sepal_width <- iris_data[["Sepal Width"]]
sel <- which(sepal_width > 10)
sepal_width[sel] <- sepal_width[sel] / 10

iris_data[["Sepal Width"]] <- sepal_width

This concisely informs the reader that a change was made, how and why.

And what happens when Ronald Fisher raises from his grave and collects another 150 flowers? You will get an updated data set with 300 flowers and import it again. In the best case scenario, you will find the problem again (or remember it). Then you will have to go into Excel and make the same edit again. And all the other edits as well. If you did it in R, you simply use your script again.

undead Ronald Fisher

3.7.3 Working with Excel and other spreadsheets

We all use Excel, LibreOffice Calc or Google Sheets. They are great tools, no doubt about that. However, by now you can see that certain operations make working with R harder, and some even that can mess up your data.

How to work with Excel

If possible, you should actually avoid working with Excel when working with scientific data. Excel does certain operations automatically (depending on the settings), so it can change your data without leaving a trace of the changes. This is the precise opposite of what you want in science.

Avoid working with Excel

Excel can automatically replace strings with dates, like changing MARCH9 into 9th of March. However, MARCH9 is a valid gene name. In fact, it turns out that a substantial proportion (about a third!) of all Excel files containing gene names and published as supplementary materials online contain gene names transformed to dates. Not only that, but even though that this has been discovered many years ago and even some genes were officialy renamed because of Excel, this is still a problem. And Excel is now able to recognize dates in many languages, exacerbating the problem (Abeysooriya et al. 2021).

However, sometimes it is unavoidable to work with Excel.

Don’t’s:

  • Using color and font to encode information, for example marking the different treatments with different background color or formatting controls in bold. While it is possible to read this information in R, it is not trivial. In addition, even outside of R formatting of cells can be lost when the data is passed around, and this would mean that the information is lost.
  • Comments in the same cells as values, for example 50 (measured twice). This prevents the columns to be interpreted as numbers and has to be dealt with in R. Create a separate column for comments, even it is just a single comment in a row.
  • Meta-data information in the header. Very often, column names contain additional information, for example units, dates, encoding (like “0 for male, 1 for female”). This makes it hard to both, import the data and actually use that information. Create a separate sheet in your spreadsheet for meta data, with one row per column name.
  • Adding header and tail information. Just like in the example with deaths.xlsx file, additional lines in spreadsheets require you to carefully select what you wish to import. Avoid it if you can, use another sheet.
  • More than one table on a spreadsheet. A spreadsheet should contain a single table. Otherwise, just like in case of header and tail information, you will have to spend some time cautiously selecting the areas that you want to import.
  • Merging cells. Merged cells are a nightmare to work with in R. If you have to, unmerge them before importing the data.

Do’s:

  • Use a single table per sheet. If you have more than one table, use more than one sheet.
  • Use a single header row. If you have more than one header row, use create another sheet with meta-data on the column names.
  • Switch off automatic conversion. Excel can automatically change your data in many ways. For example, it can change gene names to dates. It can also change the decimal separator, which can mess up your data. Newer versions of Excel allow to switch off this behaviour, so do it (here are the instructions).
  • Control your import. When importing data from CSV files to TSV files, Excel allows you to control which fields are imported as text, which as dates etc. Use this feature to make sure that your data is correctly imported.

3.8 Review

Things that you learned today:

  • Reading and writing
    • read_csv, read_tsv, read_delim, read_xls, read_xlsx
    • reading a file from working directory
    • reading a file from a different directory using relative paths
  • Paths
    • Get the current working directory with getwd()
    • relative and absolute paths
  • Diagnosing datasets
    • data types: str, class, typeof
    • summaries: summary, glimpse
    • categorical data: unique, table
    • skim() from the skimr package
    • diagnosis checklist
  • Correcting datasets
    • converting character vectors to numbers with as.numeric()
    • checking a vector with is.numeric()
    • checking a logical vector with any()
    • replacing values with str_replace_all()
  • Principles of data management
    • keeping originals
    • versioning
    • data editing
    • working with Excel
  • Other
    • using help pages to find out about functions and their parameters