- Reading data
- Cleaning data
- (Regular expressions)
- Searching, sorting and selecting
2024-09-19
Main data types you will encounter:
Data type | Function | Package | Notes |
---|---|---|---|
Columns separated by spaces | read_table() |
readr /tidyverse |
one or more spaces separate each column |
TSV / TAB separated values | read_tsv() |
readr /tidyverse |
Delimiter is tab (\t ). |
CSV / comma separated | read_csv() |
readr /tidyverse |
Comma separated values |
Any delimiter | read_delim() |
readr /tidyverse |
Customizable |
XLS (old Excel) | read_xls() read_excel() |
readxl |
Just don’t use it. From the readxl package. |
XLSX (new Excel) | read_xlsx() read_excel() |
readxl |
From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame! |
Note: there are also “base R” functions read.table
, read.csv
, read.tsv
(there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.
readr
package. This package is loaded automatically when you load the tidyverse
package: library(tidyverse)
. Then, use the functions read_csv
, read_tsv
etc.readxl
package: library(readxl)
. Then, use the function read_excel
.absolute path start at root directory, e.g.C:\Users\mbenary_c\CUBI\teaching\202403_RCrashCourse\lecture01.R
start at the working directory lecture01.R
Where am I? - getwd()
Downloading files from our git-repository
Read, inspect the following files:
TB_ORD_Gambia_Sutherland_biochemicals.csv
iris.csv
meta_data_botched.xlsx
Which functions would you use?
The function readxl_example("deaths.xls")
returns a file name. Read this file:
fn <- readxl_example("deaths.xls") data <- read_excel(fn)
How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel
). How can you force the date columns to be interpreted as dates and not numbers?
tibbles belong to the tidyverse. They are nice to work with and very useful. Also, they are mostly identical to data frames.
One crucial difference between tibble and data frame is that tibble[ , 1 ]
returns a tibble, while dataframe[ , 1]
returns a vector. The second crucial difference is that it does not support row names (on purpose!).
Video: tibbles vs data frames, 10 minutes -> for later
If you use Excel, follow these rules:
Column names should be uniform.
You can use the janitor package to clean up column names:
library(janitor) data <- read_csv("data.csv") data <- clean_names(data)
Potential problems:
library(tidyverse) myiris <- read_csv("../Datasets/iris.csv")
summary
, class
and typeof
functions to find potential problems. This is what we expect:typeof(myiris$Petal.Width)
## [1] "double"
class(myiris$Petal.Width)
## [1] "numeric"
summary(myiris$Petal.Width)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.100 0.300 1.300 1.199 1.800 2.500
typeof(myiris$Petal.Width) class(myiris$Petal.Width) summary(myiris$Petal.Width)
Longer explanation: typeof
returns the so called “storage type”, that is the very basic type of a value (string, logical, integer, list etc.). class
can be different (more specific). For example, if mtx
is a matrix of numbers, typeof(mtx)
is double
, and class(mtx)
is matrix
.
summary
, class
and typeof
functions to find potential problems. This is not what we expect:typeof(myiris$`Sepal Length`)
## [1] "character"
class(myiris$`Sepal Length`)
## [1] "character"
summary(myiris$`Sepal Length`)
## Length Class Mode ## 150 character character
(we use the back ticks because the column name contains a space)
table()
for overviewWhen used with one argument, table
shows how many times each value occurs:
table(myiris$Species)
## ## setosa Setosa versicolor Versicolor virginica Virginica ## 45 5 42 8 46 4
table()
for constructing contingency tablesWhen used with two arguments, table
constructs a contingency table:
library(readxl) meta_data <- read_excel("../Datasets/meta_data_botched.xlsx") table(meta_data$PLACEBO, meta_data$ARM)
## ## A A . Agrip. AGRIPPAL control F Fl. FLUAD P PLACEBO ## 0 1 1 3 34 0 2 1 35 0 0 ## 1 0 0 0 0 4 0 0 0 1 33 ## no 0 0 0 2 0 0 0 1 0 0 ## No 0 0 0 0 0 0 0 1 0 0 ## NO 0 0 0 1 0 0 0 0 0 0 ## Yes 0 0 0 0 0 0 0 0 0 1 ## YES 0 0 0 0 0 0 0 0 0 1
This can tell us if there are any inconsistencies in the data.
summary_colorDF
which can be used to diagnose problems with different flavors of data frames:library(colorDF) summary_colorDF(myiris)
## # Color data frame (class colorDF) 5 x 5: ## │Col │Class│NAs │unique│Summary ## 1│Sepal Length│<chr>│ 0│ 37│5: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.5: 7, 5.8: 7, 6.4: 7… ## 2│Sepal Width │<dbl>│ 0│ 25│ 2.00 [ 2.80 < 3.00> 3.38] 36.00 ## 3│Petal?Length│<chr>│ 0│ 44│1.4: 13, 1.5: 13, 4.5: 8, 1.3: 7, 1.6: 7, 5.1: 7, 5.6: 6, 4: … ## 4│Petal.Width │<dbl>│ 0│ 22│0.1 [0.3 <1.3> 1.8] 2.5 ## 5│Species │<chr>│ 0│ 6│virginica: 46, setosa: 45, versicolor: 42, Versicolor: 8, Set…
iris.csv
using the read_csv
function. Spot the problems. How can we deal with them?meta_data_botched.xlsx
. Spot the errors. How can we deal with them?as.numeric
, as.character
, as.factor
etc.Use logical vectors to substitute values which are incorrect:
nas <- is.na(some_df$some_column) some_df$some_column[nas] <- 0
is.na
returns a logical vector of the same length as the input vectorsome_df$some_column[nas]
returns only the values which are NA
some_df$some_column[nas] <- 0
replaces the NA
values by 0
Use logical vectors to substitute values which are incorrect:
to_replace <- some_df$some_column == "male" some_df$some_column[to_replace] <- "M"
some_df$some_column == "male"
returns a logical vector with TRUE
for all values which are equal to “male”Use as.numeric, as.character, as.factor etc. to enforce a data format:
some_df$some_column <- as.numeric(some_df$some_column)
as.numeric
converts a vector to numeric valuesas.character
converts a vector to character valuesas.factor
converts a vector to a factorNote: dates are special case. If you are in a pinch, take a look at the lubridate
package.
Regular expressions are a powerful tool to search and replace, not only for mending / cleaning data, but also for data processing in general.
grep(pattern, string)
is the programming equivalent of Ctrl F
in Word/Excelteachers <- c("January", "Manuela")
grep("an", teachers)
find the string “an” in teachers -> what’s the difference when using grepl
grep("^J", teachers)
find the string which starts with a capital Jgsub(pattern, string, text)
substitute all occurences of pattern
in by string
in text
sub(...)
same, but only the first occurence## Clearly broken! group <- c("ontrol", "Montrol", "Kontrol", "Kkkkkontrl", "hello Connnnnnntrol", "oh what a nice day it is, controooool", "control ", "Control, incoming") ## replace upper-case "Control" group2 <- gsub("[ckCK]+on*tro*l$", "control", group) grepl("[ckCK]+on*tro*l$", group) group != group2 ## replace all spaces group <- gsub(" ", "", group)
Notes:
The above operations are so common that there are “shortcut” functions defined in R:
## get rid of the spaces trimws(group) ## make everything lower case tolower(group) ## make everything upper case toupper(group)
Is there not a way to simplify things?
group <- c("Control", " control", "control ", "Control ") # one method gsub(" ", "", group) # but what happens if the groups are "group 1", "group 2" etc? # substitute at the end of the string gsub(" $", "", group) # at the beginning gsub("^ ", "", group) # any number of occurences gsub(" *", "", group)
group <- c("Control", " control", "control ", "Control ") group1 <- trimws(group) group2 <- tolower(group1) group <- tolower(trimws(group))
In regular expressions, some characters or strings have special meanings
Some mean what is supposed to be replaced (e.g. .
, dot, means “any character”)
Some mean how many times something gets matched; e.g. *
means “any number of matches of the previous character, so .*
means any character any number of times (like, everything)
Some characters at certain position may also have a special meaning (like ^
means “match the beginning of a string” and $
at the end of a regular expression means “match at the end of a string”).
A string like [abc]
will match only a, b and c, but not d. You can use ranges, e.g. [0-9]
or [a-z]
foo <- c("a1_123", "a1_231", "a2_17", "B2_23", "c2_889") # substitute digits only gsub("[0-9]*", "", foo) # substitute letters only gsub("[a-z]*", "", foo) # same, but case insensitive gsub("[a-z]*", "", foo, ignore.case=TRUE) # substitute everything after "_" # dot means "any character" gsub("_.*", "", foo)
Note: the function gsub
takes three arguments:
It then finds all occurences of the regular expression in each element of the character vector and replaces them by the replacement.
Often, due to typos, when you read a data file, a numeric column becomes a character string, because for example someone entered “x15” instead of “15” or mixed German and English decimal separators (e.g. “1,5” and “1.5” in the same column).
We can use the as.numeric
function to convert the strings to numeric values, but this does not work as intended:
foo <- c("xxxxx15", "15.5", "15,5") as.numeric(foo)
## [1] NA 15.5 NA
We need to substitute the strings first:
foo <- gsub("[a-zA-Z]*", "", foo) # remove letters foo <- gsub(",", ".", foo) # convert decimal separator as.numeric(foo)
## [1] 15.0 15.5 15.5
You can wrap gsub around another gsub
vec <- c("mouse", " Mouse", "mus domesticus", "chicken", "Schicken", "Duck") vec <- gsub("S*chicken", "Chicken", gsub("^m.*", "Mouse", gsub("^ *", "", vec))) ## same as vec <- gsub("^ *", "", vec) vec <- gsub("^m.*", "Mouse", vec) vec <- gsub("S*chicken", "Chicken", vec)
c("male", "Male ", "M", "F", "female", " Female")
gsub
and the toupper
function, clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
c("ANKRD22", "ANKEN", "ank.rep.domain 12", "ifng-1", "ANKRD-33", " ankrd23", "MAPK)
iris.csv
using the read_csv
function. Spot and correct errors.meta_data_botched.xlsx
. Spot and correct errors.Main data types you will encounter:
Data type | Function | Package | Notes |
---|---|---|---|
TSV / TAB separated values | write.table() |
base R |
Tab is \t . A very general and customizable function Warning: column names |
write_tsv() |
readr |
No rownames! | |
CSV / comma separated | write.csv() |
base R |
a wrapper around write.table() |
write_csv() |
readr |
No rownames! | |
XLS (old Excel) | Just don’t use it. No, seriously, don’t. | ||
XLSX (new Excel) | write_xlsx() |
writexl |
Warning: row names omitted! |
The default base R functions read.*
assign (unless otherwise specified) row names to a data frame.
The tidyverse read_*
family of functions return not a data frame, but its variant, a tibble. These tibbles behave very much like data frame; however, tidyverse discourages the use of row names and sometimes even removes existing row names on data frames.