I have data that has multiple observations per week (SARS-CoV-2 detected in wastewater), but only one outcome per week (hospitalizations). Currently each row is by date of collection, and it also has "week" as a column. I would like to collapse the rows in the same week, but have the wastewater observation for the week be the average of the individual collections. Most weeks have 2 observations, but some have 1 and some have three.
Data currently looks like this:
week <- c(1,1,1,2,2,3)
col_a <- c(1,2,2,4,2,7)
col_b <- c(4,2,3,1,2,5)
col_c <- c(4,2,3,2,2,4)
df <- data.frame(week,col_a,col_b,col_c)
week date site_a site_b hosp
1: 1 1 1 4 0
2: 1 2 2 2 0
3: 1 3 2 3 3
4: 2 4 4 1 0
5: 2 5 2 2 2
6: 3 6 7 5 4
And I'd like it to look like this:
Week SiteA SiteB Hosp
1 1.667 3 3
2 3 1.5 2
3 7 5 4
When I try group_by, I get almost all the way there, but hospitalizations is now an average as well:
> df1%>%group_by(week)%>%summarise_all(funs(mean))
# A tibble: 3 × 5
week date site_a site_b hosp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 1.67 3 1
2 2 4.5 3 1.5 1
3 3 6 7 5 4
How do I group by week, and have the sites average but the hospitalizations sum?
Would love some help, thanks!
CodePudding user response:
This should do the trick:
#### Load Library ####
library(tidyverse)
#### Group & Summarise ####
df %>%
group_by(week) %>%
summarise(SiteA = mean(col_a),
SiteB = mean(col_b),
Hospital = mean(col_c))
Which gives you this:
# A tibble: 3 × 4
week SiteA SiteB Hospital
<dbl> <dbl> <dbl> <dbl>
1 1 1.67 3 3
2 2 3 1.5 2
3 3 7 5 4
CodePudding user response:
Just to include a base R answer
One thing I do all the time is use split
. You can split the data.frame by the week variable and apply a colMeans
function.
Using the sapply
ensures that the result is combined back into a data.frame - but usually needs a t
ranspose function to resemble the original data.frame:
t(sapply(split(df, df$week, drop = F), colMeans))
# week col_a col_b col_c
#1 1 1.666667 3.0 3
#2 2 3.000000 1.5 2
#3 3 7.000000 5.0 4