I'm trying to use pivot_longer to enlongate my dataframe, but I don't need it to be fully long, and would like to output multiple "values" columns.
Example:
df <- tibble(
ids = c("protein1", "protein2"),
mean.group1 = sample(1:1000, 2),
mean.group2 = sample(1:1000, 2),
se.group1 = sample(1:10, 2),
se.group2 = sample(1:10, 2)
)
df
# A tibble: 2 × 5
ids mean.group1 mean.group2 se.group1 se.group2
<chr> <int> <int> <int> <int>
1 protein1 763 456 6 4
2 protein2 820 624 4 7
My desired output is:
df2 <- tibble(
ids = c("protein1", "protein1", "protein2", "protein2"),
mean = c(df$mean.group1[1], df$mean.group2[1], df$mean.group1[2], df$mean.group2[2]),
se = c(df$se.group1[1], df$se.group2[1], df$se.group1[2], df$se.group2[2]),
group = c("group1", "group2", "group1", "group2")
)
df2
# A tibble: 4 × 4
ids mean se group
<chr> <int> <int> <chr>
1 protein1 763 6 group1
2 protein1 456 4 group2
3 protein2 820 4 group1
4 protein2 624 7 group2
So far, I have tried multiple subsequent pivot_longer()
followed by unique()
, but this is messing up the output:
df_longer <- df %>%
pivot_longer(cols = starts_with("mean."),
names_to = "group",
names_prefix = "mean.",
values_to = "mean") %>%
unique() %>%
pivot_longer(cols = starts_with("se."),
names_to = "group",
names_prefix = "se.",
values_to = "se",
names_repair = "unique") %>%
unique()
df_longer
# A tibble: 8 × 5
ids group...2 mean group...4 se
<chr> <chr> <int> <chr> <int>
1 protein1 group1 763 group1 6
2 protein1 group1 763 group2 4
3 protein1 group2 456 group1 6
4 protein1 group2 456 group2 4
5 protein2 group1 820 group1 4
6 protein2 group1 820 group2 7
7 protein2 group2 624 group1 4
8 protein2 group2 624 group2 7
I sort of understand why - the rows are being duplicated too many times, and thus the group identity is not being kept for each row. However, I'm having trouble coming up with a solution. I'm aware that there's a names_pattern
option but I'm not sure how it would apply in this case.
Any help would be much appreciated! I've considered pivoting to fully long format (i.e. having a "measurement" column for each 'mean', 'se', etc) and then using pivot_wider()
to pivot to the format I need, but I also haven't been able to figure out how to do that. As well, let me know if any more information is needed. My actual dataset deals with 4 different measurements (same format, i.e. measurement.group) and thousands of proteins, but the principle should be the same, I hope!
CodePudding user response:
We don't need multiple calls if we specify the names_to
as a vector of values i.e. .value
- returns the value of the columns and 'group' the column with the suffix of column names. Here, we use names_sep
as .
to split at the .
library(tidyr)
pivot_longer(df, cols = -ids, names_to = c(".value", "group"),
names_sep = "\\.")
-output
# A tibble: 4 × 4
ids group mean se
<chr> <chr> <int> <int>
1 protein1 group1 982 3
2 protein1 group2 657 7
3 protein2 group1 663 9
4 protein2 group2 215 1
NOTE: values are different as sample
was used in creation of input data without a set.seed
specified