I have been looking for a way to transform my data in R. As I am not quite sure what exactly I’m looking to do, my internet search has been frustrating.
To explain: let’s say I ran a study with 10 subjects, who each made two observations. Each observation is made on Oranges, Apples and Bananas. There are four aspects to be rated on each fruit set, but each participant only rates the fruit on two of the four aspects.
This leads me to this dataset, with lots of NAs:
df <- data.frame(ID = (1:10),
Age = c(20, 20, 20, 20, 19, 25, 23, 28, 30, 18),
A1_Orange = c(1, NA, 2, 1, NA, NA, 3, NA, NA, 3),
A1_Apple = c(2, NA, 2, 2, NA, NA, 4, NA, NA, 4),
A1_Banana = c(2, NA, 2, 4, NA, NA, 2, NA, NA, 2),
A2_Orange = c(NA, 3, 3, NA, 3, NA, NA, 2, 1, NA),
A2_Apple = c(NA, 3, 4, NA, 3, NA, NA, 3, 1, NA),
A2_Banana = c(NA, 2, 1, NA, 4, NA, NA, 4, 2, NA),
A3_Orange = c(3, NA, NA, 3, NA, 4, 2, NA, 4, NA),
A3_Apple = c(4, NA, NA, 2, NA, 4, 3, NA, 4, NA),
A3_Banana = c(3, NA, NA, 2, NA, 4, 4, NA, 4, NA),
A4_Orange = c(NA, 4, NA, NA, 3, 3, NA, 4, NA, 2),
A4_Apple = c(NA, 3, NA, NA, 3, 3, NA, 4, NA, 3),
A4_Banana = c(NA, 2, NA, NA, 2, 3, NA, 4, NA, 2))
My ideal transformation would lead my data to look something like this:
df2 <- data.frame(ID = c (1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6,7, 7, 8, 8, 9, 9, 10, 10),
Age = c(20, 20, 20, 20, 20, 20, 19, 19, 19, 19, 25, 25, 23, 23, 28, 28, 30, 30, 18, 18),
Orange = c(1, 3,3, 4, 2, 3, 1, 3, 3, 3, 4, 3, 3, 2, 2, 4, 1, 4, 3, 2),
Apple = c(2, 4, 3, 3, 2, 4, 2, 2, 3, 3, 4, 3, 4, 3, 3, 4, 1, 4, 4, 3),
Banana = c(2, 3, 2, 2, 2, 1, 4, 2, 4, 2, 4, 3, 2, 4, 4, 4, 2, 4, 2, 2),
Aspect = c(A1, A3, A2, A4, A1, A2, A1, A3, A2, A4, A3, A4, A1, A3, A2, A4, A2, A3, A1, A4))
I understand I’d first have to use pivot_longer
as there are two observations per participant.
For the rest of the transformations needed, I looked at functions like cbind
, rbind
, coalesce
, duplicated
, unite
, from which coalesce
seemed to be my best bet of figuring this out, but so far I have not been able to do so.
To get that extra variable ‘Aspect’ I thought maybe something like grepl
would be useful, but again I am not sure how to practically implement this.
Any insights would be much appreciated!
CodePudding user response:
You could use pivot_longer()
from tidyr
and set ".value"
into the argument names_to
.
From the doc of pivot_longer
:
".value"
indicates that component of the name defines the name of the column containing the cell values, overridingvalues_to
.
library(tidyr)
df %>%
pivot_longer(contains("_"),
names_to = c("Aspect", ".value"),
names_sep = "_") %>%
drop_na()
Output
# A tibble: 20 × 6
ID Age Aspect Orange Apple Banana
<int> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 20 A1 1 2 2
2 1 20 A3 3 4 3
3 2 20 A2 3 3 2
4 2 20 A4 4 3 2
5 3 20 A1 2 2 2
6 3 20 A2 3 4 1
7 4 20 A1 1 2 4
8 4 20 A3 3 2 2
9 5 19 A2 3 3 4
10 5 19 A4 3 3 2
11 6 25 A3 4 4 4
12 6 25 A4 3 3 3
13 7 23 A1 3 4 2
14 7 23 A3 2 3 4
15 8 28 A2 2 3 4
16 8 28 A4 4 4 4
17 9 30 A2 1 1 2
18 9 30 A3 4 4 4
19 10 18 A1 3 4 2
20 10 18 A4 2 3 2
CodePudding user response:
You could use tidyverse
pivoting to do it all in one go:
library(tidyverse)
df |>
pivot_longer(-c(ID, Age), names_sep = "_", names_to = c("Aspect", "Fruit"), values_drop_na = TRUE) |>
pivot_wider(c(ID, Age, Aspect), names_from = "Fruit", values_from = "value")
Output:
# A tibble: 20 × 6
ID Age Aspect Orange Apple Banana
<int> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 20 A1 1 2 2
2 1 20 A3 3 4 3
3 2 20 A2 3 3 2
4 2 20 A4 4 3 2
5 3 20 A1 2 2 2
6 3 20 A2 3 4 1
7 4 20 A1 1 2 4
8 4 20 A3 3 2 2
9 5 19 A2 3 3 4
10 5 19 A4 3 3 2
11 6 25 A3 4 4 4
12 6 25 A4 3 3 3
13 7 23 A1 3 4 2
14 7 23 A3 2 3 4
15 8 28 A2 2 3 4
16 8 28 A4 4 4 4
17 9 30 A2 1 1 2
18 9 30 A3 4 4 4
19 10 18 A1 3 4 2
20 10 18 A4 2 3 2
Update: Darren Tsai's solution is more elegant (upvoted).