Home > Mobile >  R Rowise operation on partially similar columns
R Rowise operation on partially similar columns

Time:10-14

I have a df with multiple instances of each column. I'd like to "concatenate" all rows for each group of column. Possibly with dplyr if that's possible.

Example:

IDs p001_i1 p001_i2 p501_i1 p501_i2 p501_i3
AA NA NA 1 NA NA
AB 5 NA NA NA NA
AC NA 10 NA NA 2

Here for example I'd like to group the columns with similar starting name (here "p001" and "p501" but there can be many more, and different numbers of instances for each "pxxx" (p001 has 2 columns, p501 has 3)) and report their non-missing value for each row. The final result would be:

IDs p001 p501
AA NA 1
AB 5 NA
AC 10 2

If there's multiple values for each row, it could for example get the mean or the value of the latest field (ie priority i3 > i2 > i1).

I've looked at across(), c_across() and pmap(), but I can't see how to implement that on each "group" of columns. Thanks!

CodePudding user response:

Here is a tidyverse option

library(tidyverse)
df %>%
    pivot_longer(
        -IDs, names_pattern = c("(. )_(. )"), names_to = c("name", NA)) %>%
    group_by(IDs, name) %>%
    summarise(value = mean(value, na.rm = TRUE), .groups = "drop") %>%
    pivot_wider()
## A tibble: 3 × 3
#  IDs    p001  p501
#  <chr> <dbl> <dbl>
#1 AA      NaN     1
#2 AB        5   NaN
#3 AC       10     2

Explanation: Reshape from wide to long (and only retain the first part of the wide column names "p001", "p501" through using names_pattern and names_to), group by IDs and name, then calculate the mean (ignoring NAs) and reshape again from long to wide.


Sample data

df <-read.table(text = "IDs p001_i1 p001_i2 p501_i1 p501_i2 p501_i3
AA  NA  NA  1   NA  NA
AB  5   NA  NA  NA  NA
AC  NA  10  NA  NA  2", header  =T)

CodePudding user response:

Another tidyverse option, but this preserves separate groups for the AC column:

df %>%
  pivot_longer(-1) %>%
  group_by(IDs, name_prefix = str_extract(name, "^. ?(?=_)")) %>%
  filter(!is.na(value)) %>%
  slice_tail(n = 1) %>%
  ungroup() %>%
  pivot_wider(names_from = name_prefix, values_from = value, values_fill = NA) %>%
  select(-name)


# A tibble: 4 × 3
  IDs    p501  p001
  <chr> <int> <int>
1 AA        1    NA
2 AB       NA     5
3 AC       NA    10
4 AC        2    NA

Also, you can check this out:

How to get value of last non-NA column

  • Related