I need some help consolidating columns in R
I have ~130 columns, some of which have a similar name. For example, I have ~25 columns called "pathogen". However, after importing my datasheet into R, these colums are now listed as follows : pathogen..1, pathogen...2, etc. Because of how R renamed these columns, I'm not sure how to proceed.
I need to consolidate all my columns with the same/similar name, so that I have only 1 column called "pathogen". I also need this consolidated column to include the sums of all the consolidated columns called "pathogen".
here an example of my input
sample Unidentified…1 Unidentified…2 Pathogen..1 Pathogen…2
1 5 3 6 8
2 7 2 1 0
3 8 4 2 9
4 9 6 4 0
5 0 7 5 1
Here is my desired output
Sample Unidentified Pathogen
1 8 14
2 9 1
3 12 11
4 15 4
5 7 6
Any help would be really appreciated.
CodePudding user response:
Here I reshape long to make the column names more easily manipulable. I separate them into "stub" and "number" values and the default separator settings work fine. Then I sum the total values for each id-stub combo, and spread wide again.
library(tidyverse)
data.frame(
check.names = FALSE,
sample = c(1L, 2L, 3L, 4L, 5L),
`Unidentified…1` = c(5L, 7L, 8L, 9L, 0L),
`Unidentified…2` = c(3L, 2L, 4L, 6L, 7L),
Pathogen..1 = c(6L, 1L, 2L, 4L, 5L),
`Pathogen…2` = c(8L, 0L, 9L, 0L, 1L)
) %>%
pivot_longer(-sample) %>%
separate(name, c("stub","num")) %>%
count(sample, stub, wt = value) %>%
pivot_wider(names_from = "stub", values_from = "n")
Result
# A tibble: 5 × 3
sample Pathogen Unidentified
<int> <int> <int>
1 1 14 8
2 2 1 9
3 3 11 12
4 4 4 15
5 5 6 7
CodePudding user response:
Here is an option where you pivot to create the two groups and then you summarize.
library(tidyverse)
df |>
pivot_longer(cols = -sample,
names_to = ".value",
names_pattern = "(\\w )") |>
group_by(sample) |>
summarise(across(everything(), sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or with Base R
data.frame(
sample = 1:5,
Unidentified = rowSums(df[,grepl("Unidentified", colnames(df))]),
Pathogen = rowSums(df[,grepl("Pathogen", colnames(df))])
)
#> sample Unidentified Pathogen
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6
or another pivot option where we go long and then immediately go long and summarize the nested cells.
library(tidyverse)
df |>
pivot_longer(-sample, names_pattern = "(\\w )") |>
pivot_wider(names_from = name,
values_from = value,
values_fn = list(value = sum))
#> # A tibble: 5 x 3
#> sample Unidentified Pathogen
#> <dbl> <dbl> <dbl>
#> 1 1 8 14
#> 2 2 9 1
#> 3 3 12 11
#> 4 4 15 4
#> 5 5 7 6