Sorry for the post title; I could not think of a more succinct way to describe my problem. Say I have a dataset with three sets of outcome variables for each participant. Each outcome variable has two columns, one signifying a group each observation belongs to, and another signifying the score or value for that observation. something like this
set.seed(1)
d <- tibble(id = factor(rep(c("tb_10",
"ah_04",
"ck_17"), each = 3)),
out1Fact = factor(sample(x = letters[1:5],
size = 9,
replace = T)),
out1Num = rnorm(9),
out2Fact = factor(sample(x = letters[1:5],
size = 9,
replace = T)),
out2Num = rnorm(9),
out3Fact = factor(sample(x = letters[1:5],
size = 9,
replace = T)),
out3Num = rnorm(9))
d
# output
# # A tibble: 9 x 7
# id out1Fact out1Num out2Fact out2Num out3Fact out3Num
# <fct> <fct> <dbl> <fct> <dbl> <fct> <dbl>
# 1 tb_10 a 0.487 b 0.0746 b -0.832
# 2 tb_10 d 0.738 a -1.99 e -1.17
# 3 tb_10 a 0.576 d 0.620 b -1.07
# 4 ah_04 b -0.305 a -0.0561 a -1.56
# 5 ah_04 e 1.51 d -0.156 c 1.16
# 6 ah_04 c 0.390 c -1.47 c 0.832
# 7 ck_17 b -0.621 b -0.478 d -0.227
# 8 ck_17 c -2.21 b 0.418 c 0.266
# 9 ck_17 c 1.12 d 1.36 a -0.377
Now what I need to do is collapse this into three variables: the first that indicates the outcome contained in the first part of the column name (i.e. either out1
, out2
, or out3
), the second the value of the Factor for the observation (i.e. contained in all the columns ending in 'Fact'), the third the value of the numerical observation (i.e. contained in all the columns ending in 'Num').
It should look something like this (note the values in factVal
and numVal
will not match the corresponding values in the original dataframe, this is just to show you the shape I need)
# # A tibble: 27 x 4
# id outType factVal numVal
# <fct> <int> <chr> <dbl>
# 1 tb_10 1 a 1.10
# 2 tb_10 1 e 0.144
# 3 tb_10 1 e -0.118
# 4 tb_10 2 a -0.912
# 5 tb_10 2 a -1.44
# 6 tb_10 2 c -0.797
# 7 tb_10 3 b 1.25
# 8 tb_10 3 b 0.772
# 9 tb_10 3 c -0.220
# 10 ah_04 1 b -0.425
# # ... with 17 more rows
Now pivot_longer()
and pivot_wider()
can do some miraculous stuff but this one is beyond me. I tried using the 'names_pattern` argument but this particular problem was beyond me. Any help much appreciated.
CodePudding user response:
pivot_longer(d, -id, names_pattern = "out([0-9] )(.*)", names_to = c("outType", ".value"))
# # A tibble: 27 x 4
# id outType Fact Num
# <fct> <chr> <fct> <dbl>
# 1 tb_10 1 a 0.487
# 2 tb_10 2 b 0.0746
# 3 tb_10 3 b -0.832
# 4 tb_10 1 d 0.738
# 5 tb_10 2 a -1.99
# 6 tb_10 3 e -1.17
# 7 tb_10 1 a 0.576
# 8 tb_10 2 d 0.620
# 9 tb_10 3 b -1.07
# 10 ah_04 1 b -0.305
# # ... with 17 more rows
In the names_pattern
, we identify two portions of the column names that we want to work off of: the numeric portion ("(0-9] )"
) and all characters after that number ("(.*)"
). Those two patterns correspond to the two components of names_to
, and the ".value"
special value maps to one or more different columns (in this case two).
Said different, the non-special "outType"
maps the numbers (1, 2, 3) in the column names to a single column names "outType"
(obviously).
If we could imagine mapping something like
names_pattern = "out([0-9] )(Fact|Num)"
names_to = c("outType", "Fact", "Num")
# \_ ".value" _/
then it might help to understand how the dynamic nature of the output columns are determined and mapped.