How to merge columns with similar name and sum up their numeric content in R-CodePudding

I have a dataframe with columns containing counts of observations for several populations (rows). Similar to the following, but biggest: your text

df <- data.frame (pop = c("pop1", "pop2", "pop3", "pop4","pop5"),
                L1a  = c(0,0,0,1,2),
                 L1b = c(4,2,5,0,0),
                 L1c = c(1,3,0,3,4),
               L2 = c(0,3,1,2,0)
)

So I need to merge the columns that are related (that share certain string in the column name, for example the string "L1") and sum up the counts into them. For the previous example a merged dataframe could be the following:

df_merged <- data.frame (pop = c("pop1", "pop2", "pop3", "pop4","pop5"),
              L1  = c(5,5,5,4,6),
                 L2 = c(0,3,1,2,0)
)

df_merged

How could I do that?

I tried to find a post that could help me but there isn´t. I´ll really appreciate your answer

CodePudding user response：

We could use rowSums on the columns that starts_with 'L1' in column names

library(dplyr)
df %>%
   mutate(L1 = rowSums(across(starts_with('L1'))), .keep = "unused", .after = pop)

-output

   pop L1 L2
1 pop1  5  0
2 pop2  5  3
3 pop3  5  1
4 pop4  4  2
5 pop5  6  0

If there are many such blocks of columns, use split.default to split into a list of common columns, do the rowSums by looping over the list (map) and then bind with the subset of original data

library(purrr)
library(stringr)
df %>%
   select(where(is.numeric)) %>% 
   split.default(str_remove(names(.), "[a-z] $")) %>%
    map_dfc(rowSums, na.rm = TRUE) %>% 
    bind_cols(df %>% 
     select(-where(is.numeric)), .)

-output

    pop L1 L2
1 pop1  5  0
2 pop2  5  3
3 pop3  5  1
4 pop4  4  2
5 pop5  6  0

Or use pivot_longer to reshape to 'long' format before doing the group by sum

library(tidyr)
df %>% 
   pivot_longer(cols = where(is.numeric), names_to = ".value", 
      names_pattern = "(L\\d ).*") %>%
   group_by(pop) %>% 
   summarise(across(everything(), sum, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 5 × 3
  pop      L1    L2
  <chr> <dbl> <dbl>
1 pop1      5     0
2 pop2      5     3
3 pop3      5     1
4 pop4      4     2
5 pop5      6     0

CodePudding user response：

rowwise() with c_across() lets one apply arbitrary functions over the inputs , in this case sum()

df1 |> 
  rowwise() |> 
  mutate(L1=sum(c_across(contains("L1"))),
         .keep="unused",.after=pop) |> 
  ungroup()