I have a df with multiple instances of each column. I'd like to "concatenate" all rows for each group of column. Possibly with dplyr if that's possible.
Example:
IDs | p001_i1 | p001_i2 | p501_i1 | p501_i2 | p501_i3 |
---|---|---|---|---|---|
AA | NA | NA | 1 | NA | NA |
AB | 5 | NA | NA | NA | NA |
AC | NA | 10 | NA | NA | 2 |
Here for example I'd like to group the columns with similar starting name (here "p001" and "p501" but there can be many more, and different numbers of instances for each "pxxx" (p001 has 2 columns, p501 has 3)) and report their non-missing value for each row. The final result would be:
IDs | p001 | p501 |
---|---|---|
AA | NA | 1 |
AB | 5 | NA |
AC | 10 | 2 |
If there's multiple values for each row, it could for example get the mean or the value of the latest field (ie priority i3 > i2 > i1).
I've looked at across(), c_across() and pmap(), but I can't see how to implement that on each "group" of columns. Thanks!
CodePudding user response:
Here is a tidyverse
option
library(tidyverse)
df %>%
pivot_longer(
-IDs, names_pattern = c("(. )_(. )"), names_to = c("name", NA)) %>%
group_by(IDs, name) %>%
summarise(value = mean(value, na.rm = TRUE), .groups = "drop") %>%
pivot_wider()
## A tibble: 3 × 3
# IDs p001 p501
# <chr> <dbl> <dbl>
#1 AA NaN 1
#2 AB 5 NaN
#3 AC 10 2
Explanation: Reshape from wide to long (and only retain the first part of the wide column names "p001"
, "p501"
through using names_pattern
and names_to
), group by IDs
and name
, then calculate the mean (ignoring NA
s) and reshape again from long to wide.
Sample data
df <-read.table(text = "IDs p001_i1 p001_i2 p501_i1 p501_i2 p501_i3
AA NA NA 1 NA NA
AB 5 NA NA NA NA
AC NA 10 NA NA 2", header =T)
CodePudding user response:
Another tidyverse
option, but this preserves separate groups for the AC
column:
df %>%
pivot_longer(-1) %>%
group_by(IDs, name_prefix = str_extract(name, "^. ?(?=_)")) %>%
filter(!is.na(value)) %>%
slice_tail(n = 1) %>%
ungroup() %>%
pivot_wider(names_from = name_prefix, values_from = value, values_fill = NA) %>%
select(-name)
# A tibble: 4 × 3
IDs p501 p001
<chr> <int> <int>
1 AA 1 NA
2 AB NA 5
3 AC NA 10
4 AC 2 NA
Also, you can check this out: