I have a wide data frame with columns of the following pattern: a1_var1
, a2_var1
, a3_var1
, a1_var2
, a2_var2
, a3_var2
, etc. I want to pivot this to longer, creating a new variable named a
that contains the values of the number behind the "a" (1, 2, 3) and for all values of ax_var1
to be contained in the column var1
(likewise for var2
).
Consider this example data:
df <- data.frame(`id` = seq(1:3),
a1_var1 = c(111, 211, 311),
a2_var1 = c(121, 221, 321),
a3_var1 = c(131, 231, 331),
a1_var2 = c(112, 212, 312),
a2_var2 = c(122, 222, 322),
a3_var2 = c(132, 232, 332)) # first number is ID, second number is "a[x]" value, third number is var[x]
I would now like to transform the data to long format with the following columns: id
(not unique per row anymore), a
(containing the number from which the variable was taken, e.g. a1_var2
: a
= 1
), var1
(containing all values from ax_var1
columns, i.e. x11
, x21
, x31
), and var2
(containing all values from ax_var2
columns, i.e. x12
, x22
, x32
).
I want to achieve the following structure:
id a var1 var2
1 1 111 112
1 2 121 122
1 3 131 132
2 1 211 212
2 2 221 222
2 3 231 232
3 1 311 312
3 2 321 322
3 3 331 332
So far, I'm hard-coding this for every varx
like this:
df %>% select(-c(ends_with("var2"))) %>%
pivot_longer(cols = c("a1_var1", "a2_var1", "a3_var1"), names_to = "a", values_to = "var1") %>%
mutate(a=str_extract(a, "a\\d"),
a=str_extract(a, "\\d"))
but since I have many varx columns, that's kind of bothersome - can someone point out a better way of achieving the above outcome?
CodePudding user response:
We can do this in pivot_longer
itself - i.e. specify the names_to
as c("a", ".value")
where 'a' will be column name of the prefix substring from the columns to be reshaped and .value
represents the column values. In the names_pattern
, capture the substring of column name i.e. digits (\\d
) after the 'a' and the second capture group after the _
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -id, names_to = c("a", ".value"),
names_pattern = "a(\\d )_(.*)")
-output
# A tibble: 9 × 4
id a var1 var2
<int> <chr> <dbl> <dbl>
1 1 1 111 112
2 1 2 121 122
3 1 3 131 132
4 2 1 211 212
5 2 2 221 222
6 2 3 231 232
7 3 1 311 312
8 3 2 321 322
9 3 3 331 332