Home > Software engineering >  Consolidating and summing columns in R with similar names
Consolidating and summing columns in R with similar names

Time:11-16

I need some help consolidating columns in R

I have ~130 columns, some of which have a similar name. For example, I have ~25 columns called "pathogen". However, after importing my datasheet into R, these colums are now listed as follows : pathogen..1, pathogen...2, etc. Because of how R renamed these columns, I'm not sure how to proceed.

I need to consolidate all my columns with the same/similar name, so that I have only 1 column called "pathogen". I also need this consolidated column to include the sums of all the consolidated columns called "pathogen".

here an example of my input

sample  Unidentified…1  Unidentified…2  Pathogen..1 Pathogen…2
1              5                 3             6            8
2              7                 2             1            0
3              8                 4             2            9
4              9                 6             4            0
5              0                 7             5            1

Here is my desired output

Sample  Unidentified    Pathogen
1            8           14
2            9            1
3            12           11
4            15           4
5            7            6

Any help would be really appreciated.

CodePudding user response:

Here I reshape long to make the column names more easily manipulable. I separate them into "stub" and "number" values and the default separator settings work fine. Then I sum the total values for each id-stub combo, and spread wide again.

library(tidyverse)
data.frame(
       check.names = FALSE,
            sample = c(1L, 2L, 3L, 4L, 5L),
  `Unidentified…1` = c(5L, 7L, 8L, 9L, 0L),
  `Unidentified…2` = c(3L, 2L, 4L, 6L, 7L),
       Pathogen..1 = c(6L, 1L, 2L, 4L, 5L),
      `Pathogen…2` = c(8L, 0L, 9L, 0L, 1L)
) %>%

  pivot_longer(-sample) %>%
  separate(name, c("stub","num")) %>%
  count(sample, stub, wt = value) %>%
  pivot_wider(names_from = "stub", values_from = "n")

Result

# A tibble: 5 × 3
  sample Pathogen Unidentified
   <int>    <int>        <int>
1      1       14            8
2      2        1            9
3      3       11           12
4      4        4           15
5      5        6            7

CodePudding user response:

Here is an option where you pivot to create the two groups and then you summarize.

library(tidyverse)

df |>
  pivot_longer(cols = -sample, 
               names_to = ".value", 
               names_pattern = "(\\w )") |>
  group_by(sample) |>
  summarise(across(everything(), sum))
#> # A tibble: 5 x 3
#>   sample Unidentified Pathogen
#>    <dbl>        <dbl>    <dbl>
#> 1      1            8       14
#> 2      2            9        1
#> 3      3           12       11
#> 4      4           15        4
#> 5      5            7        6

or with Base R

data.frame(
  sample = 1:5,
  Unidentified = rowSums(df[,grepl("Unidentified", colnames(df))]),
  Pathogen = rowSums(df[,grepl("Pathogen", colnames(df))])
)
#>   sample Unidentified Pathogen
#> 1      1            8       14
#> 2      2            9        1
#> 3      3           12       11
#> 4      4           15        4
#> 5      5            7        6

or another pivot option where we go long and then immediately go long and summarize the nested cells.

library(tidyverse)

df |>
  pivot_longer(-sample, names_pattern = "(\\w )") |>
  pivot_wider(names_from = name, 
              values_from = value, 
              values_fn = list(value = sum))
#> # A tibble: 5 x 3
#>   sample Unidentified Pathogen
#>    <dbl>        <dbl>    <dbl>
#> 1      1            8       14
#> 2      2            9        1
#> 3      3           12       11
#> 4      4           15        4
#> 5      5            7        6
  •  Tags:  
  • rsum
  • Related