R Rowise operation on partially similar columns-CodePudding

I have a df with multiple instances of each column. I'd like to "concatenate" all rows for each group of column. Possibly with dplyr if that's possible.

Example:

IDs	p001_i1	p001_i2	p501_i1	p501_i2	p501_i3
AA	NA	NA	1	NA	NA
AB	5	NA	NA	NA	NA
AC	NA	10	NA	NA	2

Here for example I'd like to group the columns with similar starting name (here "p001" and "p501" but there can be many more, and different numbers of instances for each "pxxx" (p001 has 2 columns, p501 has 3)) and report their non-missing value for each row. The final result would be:

IDs	p001	p501
AA	NA	1
AB	5	NA
AC	10	2

If there's multiple values for each row, it could for example get the mean or the value of the latest field (ie priority i3 > i2 > i1).

I've looked at across(), c_across() and pmap(), but I can't see how to implement that on each "group" of columns. Thanks!

CodePudding user response：

Here is a tidyverse option

library(tidyverse)
df %>%
    pivot_longer(
        -IDs, names_pattern = c("(. )_(. )"), names_to = c("name", NA)) %>%
    group_by(IDs, name) %>%
    summarise(value = mean(value, na.rm = TRUE), .groups = "drop") %>%
    pivot_wider()
## A tibble: 3 × 3
#  IDs    p001  p501
#  <chr> <dbl> <dbl>
#1 AA      NaN     1
#2 AB        5   NaN
#3 AC       10     2

Explanation: Reshape from wide to long (and only retain the first part of the wide column names "p001", "p501" through using names_pattern and names_to), group by IDs and name, then calculate the mean (ignoring NAs) and reshape again from long to wide.

Sample data

df <-read.table(text = "IDs p001_i1 p001_i2 p501_i1 p501_i2 p501_i3
AA  NA  NA  1   NA  NA
AB  5   NA  NA  NA  NA
AC  NA  10  NA  NA  2", header  =T)

CodePudding user response：

Another tidyverse option, but this preserves separate groups for the AC column:

df %>%
  pivot_longer(-1) %>%
  group_by(IDs, name_prefix = str_extract(name, "^. ?(?=_)")) %>%
  filter(!is.na(value)) %>%
  slice_tail(n = 1) %>%
  ungroup() %>%
  pivot_wider(names_from = name_prefix, values_from = value, values_fill = NA) %>%
  select(-name)


# A tibble: 4 × 3
  IDs    p501  p001
  <chr> <int> <int>
1 AA        1    NA
2 AB       NA     5
3 AC       NA    10
4 AC        2    NA

Also, you can check this out:

How to get value of last non-NA column