R crash course / 03

2024-09-19

Aims for today

Reading data
Cleaning data
(Regular expressions)
Searching, sorting and selecting

Repetitorium of Day 2

Reading and writing data

Reading data

Main data types you will encounter:

Data type	Function	Package	Notes
Columns separated by spaces	`read_table()`	`readr`/`tidyverse`	one or more spaces separate each column
TSV / TAB separated values	`read_tsv()`	`readr`/`tidyverse`	Delimiter is tab (`\t`).
CSV / comma separated	`read_csv()`	`readr`/`tidyverse`	Comma separated values
Any delimiter	`read_delim()`	`readr`/`tidyverse`	Customizable
XLS (old Excel)	`read_xls()` `read_excel()`	`readxl`	Just don’t use it. From the `readxl` package.
XLSX (new Excel)	`read_xlsx()` `read_excel()`	`readxl`	From the `readxl` package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame!

Note: there are also “base R” functions read.table, read.csv, read.tsv (there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.

Reading data

For reading text files (csv, tsv etc.), use the readr package. This package is loaded automatically when you load the tidyverse package: library(tidyverse). Then, use the functions read_csv, read_tsv etc.
For reading Excel files, use the readxl package: library(readxl). Then, use the function read_excel.

Where are your files - absolute vs relative paths

absolute path start at root directory, e.g.
C:\Users\mbenary_c\CUBI\teaching\202403_RCrashCourse\lecture01.R
start at the working directory lecture01.R
Where am I? - getwd()
Downloading files from our git-repository

Exercise 3.1

Read, inspect the following files:

TB_ORD_Gambia_Sutherland_biochemicals.csv
iris.csv
meta_data_botched.xlsx

Which functions would you use?

The function readxl_example("deaths.xls") returns a file name. Read this file:

fn <- readxl_example("deaths.xls")
data <- read_excel(fn)

How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel). How can you force the date columns to be interpreted as dates and not numbers?

Tibbles / readxl

tibbles belong to the tidyverse. They are nice to work with and very useful. Also, they are mostly identical to data frames.

One crucial difference between tibble and data frame is that tibble[ , 1 ] returns a tibble, while dataframe[ , 1] returns a vector. The second crucial difference is that it does not support row names (on purpose!).

Video: tibbles vs data frames, 10 minutes -> for later

Cleaning data

A short note on organizing your data

If possible, use a specialized tool and not Excel (a database, specialized clinical data management system, REDCap etc.)

And if you use Excel…

If you use Excel, follow these rules:

NEVER modify original files
Versionize your files
Check data for consistency (e.g. with R!)
Use preselected values
Add a meta-data sheet describing the fields
do not color-code or format-code your information (e.g. denoting treatment in bold)
Add comments in separate fields
NEVER throw away or replace any identifiers, always keep the old identifiers
understand what you describe (patients? samples? assays?)
understand the difference between wide and long format

And if you use Excel and don’t follow these rules

brace yourself for bad times
get used to nasty remarks

Standardising column names

Column names should be uniform.

They should not contain other characters than alphanumeric and underscore
Dots are allowed, but not recommended (“old style”)
They should not contain spaces.
They should start with a letter

You can use the janitor package to clean up column names:

library(janitor)
data <- read_csv("data.csv")
data <- clean_names(data)

Diagnosing problems

Potential problems:

incorrect import (e.g. numeric values as strings)
typos (e.g. “x15” instead of “15”)
incongruent labels (e.g. mixing “M”, “male” and “männlich”)
mixing upper and lower case
mixing different types of decimal separator
special numeric values (e.g. “> 10”, “< 5”)
comments in the same column (“10 (unreliable)”)
dates (also, Excel converts certain strings to dates)

Diagnosing problems

tidyverse reading functions provide a summary on the reading process, e.g.:

library(tidyverse)
myiris <- read_csv("../Datasets/iris.csv")

Use summary, class and typeof functions to find potential problems. This is what we expect:

typeof(myiris$Petal.Width)

## [1] "double"

class(myiris$Petal.Width)

## [1] "numeric"

summary(myiris$Petal.Width)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500

Diagnosing problems

typeof(myiris$Petal.Width)
class(myiris$Petal.Width)
summary(myiris$Petal.Width)

Longer explanation: typeof returns the so called “storage type”, that is the very basic type of a value (string, logical, integer, list etc.). class can be different (more specific). For example, if mtx is a matrix of numbers, typeof(mtx) is double, and class(mtx) is matrix.

Diagnosing problems

Use summary, class and typeof functions to find potential problems. This is not what we expect:

 typeof(myiris$`Sepal Length`)

## [1] "character"

  class(myiris$`Sepal Length`)

## [1] "character"

summary(myiris$`Sepal Length`)

##    Length     Class      Mode 
##       150 character character

(we use the back ticks because the column name contains a space)

`table()` for overview

When used with one argument, table shows how many times each value occurs:

table(myiris$Species)

## 
##     setosa     Setosa versicolor Versicolor  virginica  Virginica 
##         45          5         42          8         46          4

`table()` for constructing contingency tables

When used with two arguments, table constructs a contingency table:

library(readxl)
meta_data <- read_excel("../Datasets/meta_data_botched.xlsx")
table(meta_data$PLACEBO, meta_data$ARM)

##      
##        A A . Agrip. AGRIPPAL control  F Fl. FLUAD  P PLACEBO
##   0    1   1      3       34       0  2   1    35  0       0
##   1    0   0      0        0       4  0   0     0  1      33
##   no   0   0      0        2       0  0   0     1  0       0
##   No   0   0      0        0       0  0   0     1  0       0
##   NO   0   0      0        1       0  0   0     0  0       0
##   Yes  0   0      0        0       0  0   0     0  0       1
##   YES  0   0      0        0       0  0   0     0  0       1

This can tell us if there are any inconsistencies in the data.

Diagnosing problems

The colorDF package provides a function called summary_colorDF which can be used to diagnose problems with different flavors of data frames:

library(colorDF)
summary_colorDF(myiris)

## # Color data frame (class colorDF) 5 x 5:
##  │Col         │Class│NAs  │unique│Summary                                                       
## 1│Sepal Length│<chr>│    0│    37│5: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.5: 7, 5.8: 7, 6.4: 7…
## 2│Sepal Width │<dbl>│    0│    25│ 2.00 [ 2.80 < 3.00>  3.38] 36.00                             
## 3│Petal?Length│<chr>│    0│    44│1.4: 13, 1.5: 13, 4.5: 8, 1.3: 7, 1.6: 7, 5.1: 7, 5.6: 6, 4: …
## 4│Petal.Width │<dbl>│    0│    22│0.1 [0.3 <1.3> 1.8] 2.5                                       
## 5│Species     │<chr>│    0│     6│virginica: 46, setosa: 45, versicolor: 42, Versicolor: 8, Set…

Exercise 3.2: Diagnosing problems

Read the data file iris.csv using the read_csv function. Spot the problems. How can we deal with them?
Read the data file meta_data_botched.xlsx. Spot the errors. How can we deal with them?

Mending problems

Use logical vectors to substitute values which are incorrect
Use logical vectors to filter out rows which are incorrect
Enforce a data format with as.numeric, as.character, as.factor etc.
Use regular expressions to search and replace

Mending problems with logical vectors

Use logical vectors to substitute values which are incorrect:

nas <- is.na(some_df$some_column)
some_df$some_column[nas] <- 0

is.na returns a logical vector of the same length as the input vector
some_df$some_column[nas] returns only the values which are NA
some_df$some_column[nas] <- 0 replaces the NA values by 0

Mending problems with logical vectors

Use logical vectors to substitute values which are incorrect:

to_replace <- some_df$some_column == "male"
some_df$some_column[to_replace] <- "M"

some_df$some_column == "male" returns a logical vector with TRUE for all values which are equal to “male”
we then can replace them with a standardized value

Mending problems through filtering

Filtering the data: see tomorrow

Mending problems by enforcing a data format

Use as.numeric, as.character, as.factor etc. to enforce a data format:

some_df$some_column <- as.numeric(some_df$some_column)

as.numeric converts a vector to numeric values
as.character converts a vector to character values
as.factor converts a vector to a factor

Note: dates are special case. If you are in a pinch, take a look at the lubridate package.

Search and replace with regular expressions

Mending problems by search and replace

Regular expressions are a powerful tool to search and replace, not only for mending / cleaning data, but also for data processing in general.

find the incorrect values
replace incorrect values (by search and replace)
Video: introduction to regular expressions, 15 minutes

Using patterns to clean data

everything is a character (even numbers)
wildcards (jokers) allow to create patterns
grep(pattern, string) is the programming equivalent of Ctrl F in Word/Excel
teachers <- c("January", "Manuela")
grep("an", teachers) find the string “an” in teachers -> what’s the difference when using grepl
grep("^J", teachers) find the string which starts with a capital J

Substitutions (search & replace)

gsub(pattern, string, text) substitute all occurences of pattern in by string in text
sub(...) same, but only the first occurence

## Clearly broken!
group <- c("ontrol", "Montrol", "Kontrol", "Kkkkkontrl", "hello Connnnnnntrol", 
           "oh what a nice day it is, controooool", "control ", "Control, incoming")

## replace upper-case "Control"
group2 <- gsub("[ckCK]+on*tro*l$", "control", group)

grepl("[ckCK]+on*tro*l$", group)

group != group2

## replace all spaces
group <- gsub(" ", "", group)

Substitutions (search & replace)

Notes:

The above operations are so common that there are “shortcut” functions defined in R:

## get rid of the spaces
trimws(group)

## make everything lower case
tolower(group)

## make everything upper case
toupper(group)

Regular expressions

Is there not a way to simplify things?

group <- c("Control", " control", "control ", "Control   ")

# one method
gsub(" ", "", group) # but what happens if the groups are "group 1", "group 2" etc?

# substitute at the end of the string
gsub(" $", "", group)

# at the beginning
gsub("^ ", "", group)

# any number of occurences
gsub(" *", "", group)

group <- c("Control", " control", "control ", "Control   ")
group1 <- trimws(group)
group2 <- tolower(group1)

group <- tolower(trimws(group))

Regular expression cont.

In regular expressions, some characters or strings have special meanings

Some mean what is supposed to be replaced (e.g. ., dot, means “any character”)
Some mean how many times something gets matched; e.g. * means “any number of matches of the previous character, so .* means any character any number of times (like, everything)
Some characters at certain position may also have a special meaning (like ^ means “match the beginning of a string” and $ at the end of a regular expression means “match at the end of a string”).
A string like [abc] will match only a, b and c, but not d. You can use ranges, e.g. [0-9] or [a-z]
https://rstudio.github.io/cheatsheets/regex.pdf

Regular expression cont.

foo <- c("a1_123", "a1_231", "a2_17", "B2_23", "c2_889")

# substitute digits only
gsub("[0-9]*", "", foo)

# substitute letters only
gsub("[a-z]*", "", foo)

# same, but case insensitive
gsub("[a-z]*", "", foo, ignore.case=TRUE)

# substitute everything after "_"
# dot means "any character"
gsub("_.*", "", foo)

Regular expression cont.

Note: the function gsub takes three arguments:

a regular expression
a replacement
a character vector

It then finds all occurences of the regular expression in each element of the character vector and replaces them by the replacement.

Typical example

Often, due to typos, when you read a data file, a numeric column becomes a character string, because for example someone entered “x15” instead of “15” or mixed German and English decimal separators (e.g. “1,5” and “1.5” in the same column).

We can use the as.numeric function to convert the strings to numeric values, but this does not work as intended:

foo <- c("xxxxx15", "15.5", "15,5")
as.numeric(foo)

## [1]   NA 15.5   NA

We need to substitute the strings first:

foo <- gsub("[a-zA-Z]*", "", foo) # remove letters
foo <- gsub(",", ".", foo)        # convert decimal separator
as.numeric(foo)

## [1] 15.0 15.5 15.5

Regexps cont.

You can wrap gsub around another gsub

vec <- c("mouse", " Mouse", "mus domesticus", "chicken", "Schicken", "Duck")
vec <- gsub("S*chicken", "Chicken", gsub("^m.*", "Mouse", gsub("^ *", "", vec)))

## same as

vec <- gsub("^ *", "", vec)
vec <- gsub("^m.*", "Mouse", vec)
vec <- gsub("S*chicken", "Chicken", vec)

Exercise 3.3

Use gsub to make the following uniform: c("male", "Male ", "M", "F", "female", " Female")
Using gsub and the toupper function, clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
What regular expression matches all of the ankyrin repeat genes (but not other genes) in the following vector: c("ANKRD22", "ANKEN", "ank.rep.domain 12", "ifng-1", "ANKRD-33", " ankrd23", "MAPK)

Using regular expressions to clean tables

Exercise 3.4

Read the data file iris.csv using the read_csv function. Spot and correct errors.
If you have time: Read the data file meta_data_botched.xlsx. Spot and correct errors.

Writing data

Main data types you will encounter:

Data type	Function	Package	Notes
TSV / TAB separated values	`write.table()`	`base R`	Tab is `\t`. A very general and customizable function Warning: column names
	`write_tsv()`	`readr`	No rownames!
CSV / comma separated	`write.csv()`	`base R`	a wrapper around `write.table()`
	`write_csv()`	`readr`	No rownames!
XLS (old Excel)			Just don’t use it. No, seriously, don’t.
XLSX (new Excel)	`write_xlsx()`	`writexl`	Warning: row names omitted!

Row names – gory details.

The default base R functions read.* assign (unless otherwise specified) row names to a data frame.

The tidyverse read_* family of functions return not a data frame, but its variant, a tibble. These tibbles behave very much like data frame; however, tidyverse discourages the use of row names and sometimes even removes existing row names on data frames.

Aims for today

Repetitorium of Day 2

Reading and writing data

Reading data

Reading data

Where are your files - absolute vs relative paths

Exercise 3.1

Tibbles / readxl

Cleaning data

A short note on organizing your data

And if you use Excel…

And if you use Excel and don’t follow these rules

Standardising column names

Diagnosing problems

Diagnosing problems

Diagnosing problems

Diagnosing problems

table() for overview

table() for constructing contingency tables

Diagnosing problems

Exercise 3.2: Diagnosing problems

Mending problems

Mending problems with logical vectors

Mending problems with logical vectors

Mending problems through filtering

Mending problems by enforcing a data format

Search and replace with regular expressions

Mending problems by search and replace

Using patterns to clean data

Substitutions (search & replace)

Substitutions (search & replace)

Regular expressions

Regular expression cont.

Regular expression cont.

Regular expression cont.

Typical example

Regexps cont.

Exercise 3.3

Using regular expressions to clean tables

Exercise 3.4

Writing data

Row names – gory details.

`table()` for overview

`table()` for constructing contingency tables