I want to pivot a dataset like this to long, splitting the column into three 1) before the first underscore 2) after the last underscore and 3) everything in between.
I have a dataframe like this
amino_acid_change <- c("Arg2Trp", "Asp40Glu", "Asp8Asn")
col_1 <- c(13582, 5960, 3371)
col_2 <- c(12589, 7531, 3083)
df <- tibble(amino_acid_change, col_1, col_2)
df <- df %>%
rename("0_BDQ_1" = col_1,
"0_BDQ_PZA_1" = col_2)
df
and I want to pivot to long, retaining the 'amino_acid_change' column and splitting the others as above. Of course it is the regex for names_pattern that I'm finding difficult.
This is the basic code, but without the regex.
df_long <- df %>%
pivot_longer(!amino_acid_change,
names_to = c("time", "arm", "replicate"),
names_pattern = "(...)(...)(...)",
values_to = "count")
Desired output
A tibble: 6 x 5
amino_acid_change time arm replicate count
<chr> <dbl> <chr> <dbl> <dbl>
1 Arg2Trp 0 BDQ 1 13582
2 Asp40Glu 0 BDQ 1 5960
3 Asp8Asn 0 BDQ 1 3371
4 Arg2Trp 0 BDQ_PZA 1 12589
5 Asp40Glu 0 BDQ_PZA 1 7531
6 Asp8Asn 0 BDQ_PZA 1 3083
Thank you
CodePudding user response:
You can use the merge
argument in separate()
and extract()
from tidyr
package:
Here:
library(tidyr)
library(stringr)
df %>%
pivot_longer(
cols = -"amino_acid_change",
names_to = "key",
values_to = "count"
) %>%
separate(key, c("time", "arm"), extra = "merge") %>%
extract(arm, c("arm", "replicate"), "(.*)_([^_] )$")
Output is:
# A tibble: 6 × 5
amino_acid_change time arm replicate count
<chr> <chr> <chr> <chr> <dbl>
1 Arg2Trp 0 BDQ 1 13582
2 Arg2Trp 0 BDQ_PZA 1 12589
3 Asp40Glu 0 BDQ 1 5960
4 Asp40Glu 0 BDQ_PZA 1 7531
5 Asp8Asn 0 BDQ 1 3371
6 Asp8Asn 0 BDQ_PZA 1 3083
CodePudding user response:
This seems to come close to what you want, but I've had to make some assumptions about the general form of your patterns.
df %>%
pivot_longer(!amino_acid_change,
names_to = c("time", "arm", "replicate"),
names_pattern = "^(\\d )_([A-Z0-9] _?[A-Z0-9] )_([0-9] )$",
values_to = "count")
# A tibble: 6 × 5
amino_acid_change time arm replicate count
<chr> <chr> <chr> <chr> <dbl>
1 Arg2Trp 0 BDQ 1 13582
2 Arg2Trp 0 BDQ_PZA 1 12589
3 Asp40Glu 0 BDQ 1 5960
4 Asp40Glu 0 BDQ_PZA 1 7531
5 Asp8Asn 0 BDQ 1 3371
6 Asp8Asn 0 BDQ_PZA 1 3083
Explanation
Breaking down the regex...
^(\\d )
: The time
starts at the start of the column name and consists of one or more decimal digits.
_
: An underscore, which does not form part of a capturing group.
([A-Z0-9] _?[A-Z0-9] )
: The arm
. It is at least 2 characters long. It starts with a string of at least one decimal digit or uppercase letter. Then there is an optional (`?~) underscore. Finally, there is another string of at least one decimal digit or uppercase letter.
_
: An underscore, which does not form part of a capturing group.
([0-9] )$
: The replicate
consists of at least one decimal digit and ends at the end of the string.