SUBJ | CA | CRP | WBC |
---|---|---|---|
4081 | 2.31 | 0.45 | 12.20 |
4368 | 2.14 | 0.62 | 6.18 |
4601 | 2.26 | 0.60 | 5.03 |
4224 | 2.25 | 0.32 | 7.62 |
4323 | 2.04 | 0.60 | 5.76 |
Appendix: Wide vs long data
Wide and Long format
Consider a data set with laboratory measurement for a number of patients. We can measure, for example, the concentration of calcium, of C-reactive protein, number of white blood cells in billions per liter etc., etc.
We have many ways to represent the data in a data frame. For example, we can put all measurements for a patient in one row, and have a column for each measurement type. This results in a nice, compact format, where each row corresponds to one patient, and each column to one measurement. Here is how it might look like:
This is called the wide format. It is easy to view, and suitable for some applications – like principal component analysis. It is also easy to filter the data by patients, or by certain measurement parameters. However, we can’t use it for many other applications, for example for plotting and many statistical applications.
And what if we want to include additional information about the measurements? For example, when the measurement was made, in what units, who conducted the test? We could add columns for each of these, but then we would have a mess.
SUBJ | LBORRES_CA | LBORRESU_CA | LABTECH_CA | LBORDATE_CA | LBORRES_CRP | LBORRESU_CRP | LABTECH_CRP | LBORDATE_CRP | LBORRES_WBC | LBORRESU_WBC | LABTECH_WBC | LBORDATE_WBC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4081 | 2.31 | mmol/L | Mr. Hyde | 2022-05-18 | 0.45 | mg/L | Frank N. Stein | 2022-05-01 | 12.20 | 10*9/L | Frank N. Stein | 2022-04-26 |
4368 | 2.14 | mmol/L | Mr. Hyde | 2021-11-21 | 0.62 | mg/L | Dr. Jekyll | 2021-11-23 | 6.18 | 10*9/L | Dr. Jekyll | 2022-04-09 |
4601 | 2.26 | mmol/L | Dr. Jekyll | 2022-05-22 | 0.60 | mg/L | Frank N. Stein | 2021-12-02 | 5.03 | 10*9/L | Frank N. Stein | 2022-02-14 |
4224 | 2.25 | mmol/L | Dr. Jekyll | 2021-12-28 | 0.32 | mg/L | Mr. Hyde | 2022-01-06 | 7.62 | 10*9/L | Frank N. Stein | 2022-05-09 |
4323 | 2.04 | mmol/L | Mr. Hyde | 2022-06-04 | 0.60 | mg/L | Frank N. Stein | 2021-12-14 | 5.76 | 10*9/L | Mr. Hyde | 2022-01-10 |
Yeah, no, this is not going to work. It is hard to view manually, but for computing applications it is even worse. We cannot easily convert it to a matrix, and we still can’t use it for ggplot2
or linear modeling. It is, quite probably, the worst of both worlds1.
1 Actually, that is not true. Even worse is adding information directly into the column names, like CA (mmol/L, 2021-09-01, Dr. Jekyll)
. Close second comes adding information in the values directly, like 7.9 mmol/L (2021-09-01, Dr. Jekyll)
, but at least here we can attempt to automatically parse the values – as long as there are no typos…
Enter the long format. In the long format, each row corresponds to one measurement, and one measurement only:
SUBJ | LBTEST | LBTESTCD | LBORRES | LBORRESU | LBORDATE | LABTECH |
---|---|---|---|---|---|---|
4081 | Calcium | CA | 2.31 | mmol/L | 2022-05-18 | Mr. Hyde |
4081 | C Reactive Protein | CRP | 0.45 | mg/L | 2022-05-01 | Frank N. Stein |
4081 | White Blood Cells | WBC | 12.20 | 10*9/L | 2022-04-26 | Frank N. Stein |
4368 | Calcium | CA | 2.14 | mmol/L | 2021-11-21 | Mr. Hyde |
4368 | C Reactive Protein | CRP | 0.62 | mg/L | 2021-11-23 | Dr. Jekyll |
This is by far the most advantageous format for transfering and storing data. It is also the most flexible format for data analysis. We can easily convert it to the wide format (possibly losing some information, like units or who conducted the test), or to a matrix, or to a plot. We can easily filter the data, or do statistics on it.
However, as you can see above, some of the information might be duplicated many times – for example, if we have a column with patient’s name, it will appear many times next to each measurement of that patient.
To sum it up:
Long advantages:
- easier to filter, process, visualize, do statistics with
- focused on measurement (“patient ID” or equivalent is a covariate, and so is measurement type)
- safer
Wide advantages:
- groups data by a covariate (“patient ID”)
- can be easier to manage (each column one measurement type)
- less duplication of information (smaller files)
I recommend using the long format in most situations, however there are always occasions where we need to move from one option to the other. For example, we might want to do a PCA – so we need to convert to the wide format. Or the wide format is more suitable for manual data entry – and we have to convert it to the long format for analysis.
It is very useful to be fluent in switching between both formats.
In general, you should probably always use the long format as the main format for your data. It is easier to work with, and you can always convert it to wide format if you need to.
There are exceptions to this rule – typically when it comes to high-throughput data (like RNA-Seq).
Converting from wide to long:
First, let’s read an example wide data set:
library(tidyverse)
<- read_csv("Datasets/wide_example.csv")
wide wide
# A tibble: 4 × 5
subject sex control cond1 cond2
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 M 7.9 12.3 10.7
2 2 F 6.3 10.6 11.1
3 3 F 9.5 13.1 13.8
4 4 M 11.5 13.4 12.9
This is clearly a wide format – each row corresponds not to one observation (one measurement), but one subject. We want to convert this to long format, where each row corresponds to one observation (and therefore, for each subject, there are three rows).
pivot_longer()
pivot_longer(wide, cols=control:cond2,
names_to="condition", values_to="measurement")
# A tibble: 12 × 4
subject sex condition measurement
<dbl> <chr> <chr> <dbl>
1 1 M control 7.9
2 1 M cond1 12.3
3 1 M cond2 10.7
4 2 F control 6.3
5 2 F cond1 10.6
6 2 F cond2 11.1
7 3 F control 9.5
8 3 F cond1 13.1
9 3 F cond2 13.8
10 4 M control 11.5
11 4 M cond1 13.4
12 4 M cond2 12.9
The key function is pivot_longer()
. We need to specify three things:
- which columns contain the measurements – in our case
control
,cond1
, andcond2
, which we shorten tocontrol:cond2
usingtidyverse
selection syntax; - name of the columns which will contain the names of the old columns – we choose
condition
. Now in thecondition
column we will have one of the valuescontrol
,cond1
, orcond2
; - name of the column which will contain the actual measurement values – we choose
measurement
.
The remaining columns will be left as they are. So, the first three rows correspond to the originally first row. The subject
and sex
columns have identical values for all the three rows, but the condition
and measurement
columns differ. For example, the second row contains the value cond1
in the condition
column, and the value 12.3
in the corresponding measurement
column. This corresponds to the value 12.3
from the cond1
column in the first row of the original data set.
Note that we must put quotes around condition
and measurement
in the code above. These are column names, but the columns don’t exist yet, so we cannot use the tidyverse selection syntax.
Converting from long to wide
Here is another example data set, this time in long format:
<- read_csv("Datasets/long_example.csv")
long long
# A tibble: 9 × 5
subject sampleID sex condition measurement
<dbl> <chr> <chr> <chr> <dbl>
1 1 ID000001 M control 7.9
2 1 ID000002 M cond1 12.3
3 1 ID000003 M cond2 10.7
4 2 ID000004 F control 6.3
5 2 ID000005 F cond1 10.6
6 2 ID000006 F cond2 11.1
7 3 ID000007 F control 9.5
8 3 ID000008 F cond1 13.1
9 3 ID000009 F cond2 13.8
As you see, there are three subjects again, each with three measurements (one control, one for condition 1, and one for condition 2). We want to convert it to wide format, so we expect one row per subject, in total three rows.
We can use the pivot_wider()
function to do this. We need to specify which column contains the values that should be spread out, and which column contains the names of the new columns.
pivot_wider()
However, watch out. The first thing we might want to try does not give the expected result:
## not what we wanted!!! Why?
pivot_wider(long, names_from=condition, values_from=measurement)
# A tibble: 9 × 6
subject sampleID sex control cond1 cond2
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 ID000001 M 7.9 NA NA
2 1 ID000002 M NA 12.3 NA
3 1 ID000003 M NA NA 10.7
4 2 ID000004 F 6.3 NA NA
5 2 ID000005 F NA 10.6 NA
6 2 ID000006 F NA NA 11.1
7 3 ID000007 F 9.5 NA NA
8 3 ID000008 F NA 13.1 NA
9 3 ID000009 F NA NA 13.8
The problem is in the sampleID column. Given that names of the variables are in the condition
column, and measurement
is in the measurement
column, R considers all the remaining columns to be the identifier columns. But the column sampleID
contains only unique values, so they must be put in separate rows, as above.
To fix this, we need to tell R not to use sampleID
as an identifier column; instead, only the subject column should be used as an identifier. We can also throw in the sex
column if we want to keep it, since it has only 1 value per subject.
pivot_wider(id_cols=...)
## Instead:
pivot_wider(long, id_cols=c(subject, sex),
names_from=condition, values_from=measurement)
# A tibble: 3 × 5
subject sex control cond1 cond2
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 M 7.9 12.3 10.7
2 2 F 6.3 10.6 11.1
3 3 F 9.5 13.1 13.8
pivot_wider()
When using pivot_wider()
, be careful to always specify the id_cols
argument explicitely.