2024-09-19

Aims for today

  • Reading data
  • Cleaning data
  • (Regular expressions)
  • Searching, sorting and selecting

Repetitorium of Day 2

Reading and writing data

Reading data

Main data types you will encounter:

Data type Function Package Notes
Columns separated by spaces read_table() readr/tidyverse one or more spaces separate each column
TSV / TAB separated values read_tsv() readr/tidyverse Delimiter is tab (\t).
CSV / comma separated read_csv() readr/tidyverse Comma separated values
Any delimiter read_delim() readr/tidyverse Customizable
XLS (old Excel) read_xls() read_excel() readxl Just don’t use it. From the readxl package.
XLSX (new Excel) read_xlsx() read_excel() readxl From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame!

Note: there are also “base R” functions read.table, read.csv, read.tsv (there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.

Reading data

  • For reading text files (csv, tsv etc.), use the readr package. This package is loaded automatically when you load the tidyverse package: library(tidyverse). Then, use the functions read_csv, read_tsv etc.
  • For reading Excel files, use the readxl package: library(readxl). Then, use the function read_excel.

Where are your files - absolute vs relative paths

  • absolute path start at root directory, e.g.
    C:\Users\mbenary_c\CUBI\teaching\202403_RCrashCourse\lecture01.R

  • start at the working directory lecture01.R

  • Where am I? - getwd()

  • Downloading files from our git-repository

Exercise 3.1

Read, inspect the following files:

  • TB_ORD_Gambia_Sutherland_biochemicals.csv
  • iris.csv
  • meta_data_botched.xlsx

Which functions would you use?

The function readxl_example("deaths.xls") returns a file name. Read this file:

fn <- readxl_example("deaths.xls")
data <- read_excel(fn)

How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel). How can you force the date columns to be interpreted as dates and not numbers?

Tibbles / readxl

tibbles belong to the tidyverse. They are nice to work with and very useful. Also, they are mostly identical to data frames.

One crucial difference between tibble and data frame is that tibble[ , 1 ] returns a tibble, while dataframe[ , 1] returns a vector. The second crucial difference is that it does not support row names (on purpose!).

Video: tibbles vs data frames, 10 minutes -> for later

Cleaning data

A short note on organizing your data

  • If possible, use a specialized tool and not Excel (a database, specialized clinical data management system, REDCap etc.)

And if you use Excel…

If you use Excel, follow these rules:

  • NEVER modify original files
  • Versionize your files
  • Check data for consistency (e.g. with R!)
  • Use preselected values
  • Add a meta-data sheet describing the fields
  • do not color-code or format-code your information (e.g. denoting treatment in bold)
  • Add comments in separate fields
  • NEVER throw away or replace any identifiers, always keep the old identifiers
  • understand what you describe (patients? samples? assays?)
  • understand the difference between wide and long format

And if you use Excel and don’t follow these rules

  • brace yourself for bad times
  • get used to nasty remarks

Standardising column names

Column names should be uniform.

  • They should not contain other characters than alphanumeric and underscore
  • Dots are allowed, but not recommended (“old style”)
  • They should not contain spaces.
  • They should start with a letter

You can use the janitor package to clean up column names:

library(janitor)
data <- read_csv("data.csv")
data <- clean_names(data)

Diagnosing problems

Potential problems:

  • incorrect import (e.g. numeric values as strings)
  • typos (e.g. “x15” instead of “15”)
  • incongruent labels (e.g. mixing “M”, “male” and “männlich”)
  • mixing upper and lower case
  • mixing different types of decimal separator
  • special numeric values (e.g. “> 10”, “< 5”)
  • comments in the same column (“10 (unreliable)”)
  • dates (also, Excel converts certain strings to dates)

Diagnosing problems

  • tidyverse reading functions provide a summary on the reading process, e.g.:
library(tidyverse)
myiris <- read_csv("../Datasets/iris.csv")
  • Use summary, class and typeof functions to find potential problems. This is what we expect:
typeof(myiris$Petal.Width)
## [1] "double"
class(myiris$Petal.Width)
## [1] "numeric"
summary(myiris$Petal.Width)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500

Diagnosing problems

typeof(myiris$Petal.Width)
class(myiris$Petal.Width)
summary(myiris$Petal.Width)

Longer explanation: typeof returns the so called “storage type”, that is the very basic type of a value (string, logical, integer, list etc.). class can be different (more specific). For example, if mtx is a matrix of numbers, typeof(mtx) is double, and class(mtx) is matrix.

Diagnosing problems

  • Use summary, class and typeof functions to find potential problems. This is not what we expect:
 typeof(myiris$`Sepal Length`)
## [1] "character"
  class(myiris$`Sepal Length`)
## [1] "character"
summary(myiris$`Sepal Length`)
##    Length     Class      Mode 
##       150 character character

(we use the back ticks because the column name contains a space)

table() for overview

When used with one argument, table shows how many times each value occurs:

table(myiris$Species)
## 
##     setosa     Setosa versicolor Versicolor  virginica  Virginica 
##         45          5         42          8         46          4

table() for constructing contingency tables

When used with two arguments, table constructs a contingency table:

library(readxl)
meta_data <- read_excel("../Datasets/meta_data_botched.xlsx")
table(meta_data$PLACEBO, meta_data$ARM)
##      
##        A A . Agrip. AGRIPPAL control  F Fl. FLUAD  P PLACEBO
##   0    1   1      3       34       0  2   1    35  0       0
##   1    0   0      0        0       4  0   0     0  1      33
##   no   0   0      0        2       0  0   0     1  0       0
##   No   0   0      0        0       0  0   0     1  0       0
##   NO   0   0      0        1       0  0   0     0  0       0
##   Yes  0   0      0        0       0  0   0     0  0       1
##   YES  0   0      0        0       0  0   0     0  0       1

This can tell us if there are any inconsistencies in the data.

Diagnosing problems

  • The colorDF package provides a function called summary_colorDF which can be used to diagnose problems with different flavors of data frames:
library(colorDF)
summary_colorDF(myiris)
## # Color data frame (class colorDF) 5 x 5:
##  │Col         │Class│NAs  │unique│Summary                                                       
## 1│Sepal Length│<chr>│    0│    37│5: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.5: 7, 5.8: 7, 6.4: 7…
## 2│Sepal Width │<dbl>│    0│    25│ 2.00 [ 2.80 < 3.00>  3.38] 36.00                             
## 3│Petal?Length│<chr>│    0│    44│1.4: 13, 1.5: 13, 4.5: 8, 1.3: 7, 1.6: 7, 5.1: 7, 5.6: 6, 4: …
## 4│Petal.Width │<dbl>│    0│    22│0.1 [0.3 <1.3> 1.8] 2.5                                       
## 5│Species     │<chr>│    0│     6│virginica: 46, setosa: 45, versicolor: 42, Versicolor: 8, Set…

Exercise 3.2: Diagnosing problems

  • Read the data file iris.csv using the read_csv function. Spot the problems. How can we deal with them?
  • Read the data file meta_data_botched.xlsx. Spot the errors. How can we deal with them?

Mending problems

  • Use logical vectors to substitute values which are incorrect
  • Use logical vectors to filter out rows which are incorrect
  • Enforce a data format with as.numeric, as.character, as.factor etc.
  • Use regular expressions to search and replace

Mending problems with logical vectors

Use logical vectors to substitute values which are incorrect:

nas <- is.na(some_df$some_column)
some_df$some_column[nas] <- 0
  • is.na returns a logical vector of the same length as the input vector
  • some_df$some_column[nas] returns only the values which are NA
  • some_df$some_column[nas] <- 0 replaces the NA values by 0

Mending problems with logical vectors

Use logical vectors to substitute values which are incorrect:

to_replace <- some_df$some_column == "male"
some_df$some_column[to_replace] <- "M"
  • some_df$some_column == "male" returns a logical vector with TRUE for all values which are equal to “male”
  • we then can replace them with a standardized value

Mending problems through filtering

  • Filtering the data: see tomorrow

Mending problems by enforcing a data format

Use as.numeric, as.character, as.factor etc. to enforce a data format:

some_df$some_column <- as.numeric(some_df$some_column)
  • as.numeric converts a vector to numeric values
  • as.character converts a vector to character values
  • as.factor converts a vector to a factor

Note: dates are special case. If you are in a pinch, take a look at the lubridate package.

Search and replace with regular expressions

Mending problems by search and replace

Regular expressions are a powerful tool to search and replace, not only for mending / cleaning data, but also for data processing in general.

Using patterns to clean data

  • everything is a character (even numbers)
  • wildcards (jokers) allow to create patterns
  • grep(pattern, string) is the programming equivalent of Ctrl F in Word/Excel
  • teachers <- c("January", "Manuela")
  • grep("an", teachers) find the string “an” in teachers -> what’s the difference when using grepl
  • grep("^J", teachers) find the string which starts with a capital J

Substitutions (search & replace)

  • gsub(pattern, string, text) substitute all occurences of pattern in by string in text
  • sub(...) same, but only the first occurence
## Clearly broken!
group <- c("ontrol", "Montrol", "Kontrol", "Kkkkkontrl", "hello Connnnnnntrol", 
           "oh what a nice day it is, controooool", "control ", "Control, incoming")

## replace upper-case "Control"
group2 <- gsub("[ckCK]+on*tro*l$", "control", group)

grepl("[ckCK]+on*tro*l$", group)

group != group2

## replace all spaces
group <- gsub(" ", "", group)

Substitutions (search & replace)

Notes:

The above operations are so common that there are “shortcut” functions defined in R:

## get rid of the spaces
trimws(group)

## make everything lower case
tolower(group)

## make everything upper case
toupper(group)

Regular expressions

Is there not a way to simplify things?

group <- c("Control", " control", "control ", "Control   ")

# one method
gsub(" ", "", group) # but what happens if the groups are "group 1", "group 2" etc?

# substitute at the end of the string
gsub(" $", "", group)

# at the beginning
gsub("^ ", "", group)

# any number of occurences
gsub(" *", "", group)
group <- c("Control", " control", "control ", "Control   ")
group1 <- trimws(group)
group2 <- tolower(group1)

group <- tolower(trimws(group))

Regular expression cont.

In regular expressions, some characters or strings have special meanings

  • Some mean what is supposed to be replaced (e.g. ., dot, means “any character”)

  • Some mean how many times something gets matched; e.g. * means “any number of matches of the previous character, so .* means any character any number of times (like, everything)

  • Some characters at certain position may also have a special meaning (like ^ means “match the beginning of a string” and $ at the end of a regular expression means “match at the end of a string”).

  • A string like [abc] will match only a, b and c, but not d. You can use ranges, e.g. [0-9] or [a-z]

  • https://rstudio.github.io/cheatsheets/regex.pdf

Regular expression cont.

foo <- c("a1_123", "a1_231", "a2_17", "B2_23", "c2_889")

# substitute digits only
gsub("[0-9]*", "", foo)

# substitute letters only
gsub("[a-z]*", "", foo)

# same, but case insensitive
gsub("[a-z]*", "", foo, ignore.case=TRUE)

# substitute everything after "_"
# dot means "any character"
gsub("_.*", "", foo)

Regular expression cont.

Note: the function gsub takes three arguments:

  • a regular expression
  • a replacement
  • a character vector

It then finds all occurences of the regular expression in each element of the character vector and replaces them by the replacement.

Typical example

Often, due to typos, when you read a data file, a numeric column becomes a character string, because for example someone entered “x15” instead of “15” or mixed German and English decimal separators (e.g. “1,5” and “1.5” in the same column).

We can use the as.numeric function to convert the strings to numeric values, but this does not work as intended:

foo <- c("xxxxx15", "15.5", "15,5")
as.numeric(foo)
## [1]   NA 15.5   NA

We need to substitute the strings first:

foo <- gsub("[a-zA-Z]*", "", foo) # remove letters
foo <- gsub(",", ".", foo)        # convert decimal separator
as.numeric(foo)
## [1] 15.0 15.5 15.5

Regexps cont.

You can wrap gsub around another gsub

vec <- c("mouse", " Mouse", "mus domesticus", "chicken", "Schicken", "Duck")
vec <- gsub("S*chicken", "Chicken", gsub("^m.*", "Mouse", gsub("^ *", "", vec)))

## same as

vec <- gsub("^ *", "", vec)
vec <- gsub("^m.*", "Mouse", vec)
vec <- gsub("S*chicken", "Chicken", vec)

Exercise 3.3

  • Use gsub to make the following uniform: c("male", "Male ", "M", "F", "female", " Female")
  • Using gsub and the toupper function, clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
  • What regular expression matches all of the ankyrin repeat genes (but not other genes) in the following vector: c("ANKRD22", "ANKEN", "ank.rep.domain 12", "ifng-1", "ANKRD-33", " ankrd23", "MAPK)

Using regular expressions to clean tables

Exercise 3.4

  • Read the data file iris.csv using the read_csv function. Spot and correct errors.
  • If you have time: Read the data file meta_data_botched.xlsx. Spot and correct errors.

Writing data

Main data types you will encounter:

Data type Function Package Notes
TSV / TAB separated values write.table() base R Tab is \t. A very general and customizable function Warning: column names
write_tsv() readr No rownames!
CSV / comma separated write.csv() base R a wrapper around write.table()
write_csv() readr No rownames!
XLS (old Excel) Just don’t use it. No, seriously, don’t.
XLSX (new Excel) write_xlsx() writexl Warning: row names omitted!

Row names – gory details.

The default base R functions read.* assign (unless otherwise specified) row names to a data frame.

The tidyverse read_* family of functions return not a data frame, but its variant, a tibble. These tibbles behave very much like data frame; however, tidyverse discourages the use of row names and sometimes even removes existing row names on data frames.