Appendix: Wide vs long data

Wide and Long format

Consider a data set with laboratory measurement for a number of patients. We can measure, for example, the concentration of calcium, of C-reactive protein, number of white blood cells in billions per liter etc., etc.

We have many ways to represent the data in a data frame. For example, we can put all measurements for a patient in one row, and have a column for each measurement type. This results in a nice, compact format, where each row corresponds to one patient, and each column to one measurement. Here is how it might look like:

SUBJ	CA	CRP	WBC
4081	2.31	0.45	12.20
4368	2.14	0.62	6.18
4601	2.26	0.60	5.03
4224	2.25	0.32	7.62
4323	2.04	0.60	5.76

This is called the wide format. It is easy to view, and suitable for some applications – like principal component analysis. It is also easy to filter the data by patients, or by certain measurement parameters. However, we can’t use it for many other applications, for example for plotting and many statistical applications.

And what if we want to include additional information about the measurements? For example, when the measurement was made, in what units, who conducted the test? We could add columns for each of these, but then we would have a mess.

SUBJ	LBORRES_CA	LBORRESU_CA	LABTECH_CA	LBORDATE_CA	LBORRES_CRP	LBORRESU_CRP	LABTECH_CRP	LBORDATE_CRP	LBORRES_WBC	LBORRESU_WBC	LABTECH_WBC	LBORDATE_WBC
4081	2.31	mmol/L	Mr. Hyde	2022-05-18	0.45	mg/L	Frank N. Stein	2022-05-01	12.20	10*9/L	Frank N. Stein	2022-04-26
4368	2.14	mmol/L	Mr. Hyde	2021-11-21	0.62	mg/L	Dr. Jekyll	2021-11-23	6.18	10*9/L	Dr. Jekyll	2022-04-09
4601	2.26	mmol/L	Dr. Jekyll	2022-05-22	0.60	mg/L	Frank N. Stein	2021-12-02	5.03	10*9/L	Frank N. Stein	2022-02-14
4224	2.25	mmol/L	Dr. Jekyll	2021-12-28	0.32	mg/L	Mr. Hyde	2022-01-06	7.62	10*9/L	Frank N. Stein	2022-05-09
4323	2.04	mmol/L	Mr. Hyde	2022-06-04	0.60	mg/L	Frank N. Stein	2021-12-14	5.76	10*9/L	Mr. Hyde	2022-01-10

Yeah, no, this is not going to work. It is hard to view manually, but for computing applications it is even worse. We cannot easily convert it to a matrix, and we still can’t use it for ggplot2 or linear modeling. It is, quite probably, the worst of both worlds¹.

¹ Actually, that is not true. Even worse is adding information directly into the column names, like CA (mmol/L, 2021-09-01, Dr. Jekyll). Close second comes adding information in the values directly, like 7.9 mmol/L (2021-09-01, Dr. Jekyll), but at least here we can attempt to automatically parse the values – as long as there are no typos…

Enter the long format. In the long format, each row corresponds to one measurement, and one measurement only:

SUBJ	LBTEST	LBTESTCD	LBORRES	LBORRESU	LBORDATE	LABTECH
4081	Calcium	CA	2.31	mmol/L	2022-05-18	Mr. Hyde
4081	C Reactive Protein	CRP	0.45	mg/L	2022-05-01	Frank N. Stein
4081	White Blood Cells	WBC	12.20	10*9/L	2022-04-26	Frank N. Stein
4368	Calcium	CA	2.14	mmol/L	2021-11-21	Mr. Hyde
4368	C Reactive Protein	CRP	0.62	mg/L	2021-11-23	Dr. Jekyll

This is by far the most advantageous format for transfering and storing data. It is also the most flexible format for data analysis. We can easily convert it to the wide format (possibly losing some information, like units or who conducted the test), or to a matrix, or to a plot. We can easily filter the data, or do statistics on it.

However, as you can see above, some of the information might be duplicated many times – for example, if we have a column with patient’s name, it will appear many times next to each measurement of that patient.

To sum it up:

Long advantages:

easier to filter, process, visualize, do statistics with
focused on measurement (“patient ID” or equivalent is a covariate, and so is measurement type)
safer

Wide advantages:

groups data by a covariate (“patient ID”)
can be easier to manage (each column one measurement type)
less duplication of information (smaller files)

I recommend using the long format in most situations, however there are always occasions where we need to move from one option to the other. For example, we might want to do a PCA – so we need to convert to the wide format. Or the wide format is more suitable for manual data entry – and we have to convert it to the long format for analysis.

It is very useful to be fluent in switching between both formats.

Use long format

In general, you should probably always use the long format as the main format for your data. It is easier to work with, and you can always convert it to wide format if you need to.

There are exceptions to this rule – typically when it comes to high-throughput data (like RNA-Seq).

Converting from wide to long:

First, let’s read an example wide data set:

library(tidyverse)
wide <- read_csv("Datasets/wide_example.csv")
wide

# A tibble: 4 × 5
  subject sex   control cond1 cond2
    <dbl> <chr>   <dbl> <dbl> <dbl>
1       1 M         7.9  12.3  10.7
2       2 F         6.3  10.6  11.1
3       3 F         9.5  13.1  13.8
4       4 M        11.5  13.4  12.9

This is clearly a wide format – each row corresponds not to one observation (one measurement), but one subject. We want to convert this to long format, where each row corresponds to one observation (and therefore, for each subject, there are three rows).

pivot_longer()

pivot_longer(wide, cols=control:cond2,
  names_to="condition", values_to="measurement")

# A tibble: 12 × 4
   subject sex   condition measurement
     <dbl> <chr> <chr>           <dbl>
 1       1 M     control           7.9
 2       1 M     cond1            12.3
 3       1 M     cond2            10.7
 4       2 F     control           6.3
 5       2 F     cond1            10.6
 6       2 F     cond2            11.1
 7       3 F     control           9.5
 8       3 F     cond1            13.1
 9       3 F     cond2            13.8
10       4 M     control          11.5
11       4 M     cond1            13.4
12       4 M     cond2            12.9

The key function is pivot_longer(). We need to specify three things:

which columns contain the measurements – in our case control, cond1, and cond2, which we shorten to control:cond2 using tidyverse selection syntax;
name of the columns which will contain the names of the old columns – we choose condition. Now in the condition column we will have one of the values control, cond1, or cond2;
name of the column which will contain the actual measurement values – we choose measurement.

The remaining columns will be left as they are. So, the first three rows correspond to the originally first row. The subject and sex columns have identical values for all the three rows, but the condition and measurement columns differ. For example, the second row contains the value cond1 in the condition column, and the value 12.3 in the corresponding measurement column. This corresponds to the value 12.3 from the cond1 column in the first row of the original data set.

Note that we must put quotes around condition and measurement in the code above. These are column names, but the columns don’t exist yet, so we cannot use the tidyverse selection syntax.

Converting from long to wide

Here is another example data set, this time in long format:

long <- read_csv("Datasets/long_example.csv")
long

# A tibble: 9 × 5
  subject sampleID sex   condition measurement
    <dbl> <chr>    <chr> <chr>           <dbl>
1       1 ID000001 M     control           7.9
2       1 ID000002 M     cond1            12.3
3       1 ID000003 M     cond2            10.7
4       2 ID000004 F     control           6.3
5       2 ID000005 F     cond1            10.6
6       2 ID000006 F     cond2            11.1
7       3 ID000007 F     control           9.5
8       3 ID000008 F     cond1            13.1
9       3 ID000009 F     cond2            13.8

As you see, there are three subjects again, each with three measurements (one control, one for condition 1, and one for condition 2). We want to convert it to wide format, so we expect one row per subject, in total three rows.

We can use the pivot_wider() function to do this. We need to specify which column contains the values that should be spread out, and which column contains the names of the new columns.

pivot_wider()

However, watch out. The first thing we might want to try does not give the expected result:

## not what we wanted!!! Why?
pivot_wider(long, names_from=condition, values_from=measurement)

# A tibble: 9 × 6
  subject sampleID sex   control cond1 cond2
    <dbl> <chr>    <chr>   <dbl> <dbl> <dbl>
1       1 ID000001 M         7.9  NA    NA  
2       1 ID000002 M        NA    12.3  NA  
3       1 ID000003 M        NA    NA    10.7
4       2 ID000004 F         6.3  NA    NA  
5       2 ID000005 F        NA    10.6  NA  
6       2 ID000006 F        NA    NA    11.1
7       3 ID000007 F         9.5  NA    NA  
8       3 ID000008 F        NA    13.1  NA  
9       3 ID000009 F        NA    NA    13.8

The problem is in the sampleID column. Given that names of the variables are in the condition column, and measurement is in the measurement column, R considers all the remaining columns to be the identifier columns. But the column sampleID contains only unique values, so they must be put in separate rows, as above.

To fix this, we need to tell R not to use sampleID as an identifier column; instead, only the subject column should be used as an identifier. We can also throw in the sex column if we want to keep it, since it has only 1 value per subject.

pivot_wider(id_cols=...)

## Instead: 
pivot_wider(long, id_cols=c(subject, sex),
                  names_from=condition, values_from=measurement)

# A tibble: 3 × 5
  subject sex   control cond1 cond2
    <dbl> <chr>   <dbl> <dbl> <dbl>
1       1 M         7.9  12.3  10.7
2       2 F         6.3  10.6  11.1
3       3 F         9.5  13.1  13.8

Be careful with pivot_wider()

When using pivot_wider(), be careful to always specify the id_cols argument explicitely.

Exercise 1 (Converting to long format) Convert the following files to long format:

labresults_wide.csv
The built-in iris data set (data(iris))
cars.xlsx (tricky! hint: how do you tell which value in the long format belongs to which row in the wide format?)

Clean up and convert to long format (what seems to be the problem? How do we deal with that?):

mtcars_wide.csv