Home > Net >  R- Collapse rows by getting average values
R- Collapse rows by getting average values

Time:10-10

I have data that has multiple observations per week (SARS-CoV-2 detected in wastewater), but only one outcome per week (hospitalizations). Currently each row is by date of collection, and it also has "week" as a column. I would like to collapse the rows in the same week, but have the wastewater observation for the week be the average of the individual collections. Most weeks have 2 observations, but some have 1 and some have three.

Data currently looks like this:

week <- c(1,1,1,2,2,3)
col_a <- c(1,2,2,4,2,7)
col_b <- c(4,2,3,1,2,5)
col_c <- c(4,2,3,2,2,4)
df <- data.frame(week,col_a,col_b,col_c) 

   week date site_a site_b hosp
1:    1    1      1      4    0
2:    1    2      2      2    0
3:    1    3      2      3    3
4:    2    4      4      1    0
5:    2    5      2      2    2
6:    3    6      7      5    4

And I'd like it to look like this:

Week   SiteA   SiteB   Hosp
1      1.667   3       3
2      3       1.5     2 
3      7       5       4

When I try group_by, I get almost all the way there, but hospitalizations is now an average as well:

> df1%>%group_by(week)%>%summarise_all(funs(mean))
# A tibble: 3 × 5
   week  date site_a site_b  hosp
  <dbl> <dbl>  <dbl>  <dbl> <dbl>
1     1   2     1.67    3       1
2     2   4.5   3       1.5     1
3     3   6     7       5       4

How do I group by week, and have the sites average but the hospitalizations sum?

Would love some help, thanks!

CodePudding user response:

This should do the trick:

#### Load Library ####
library(tidyverse)

#### Group & Summarise ####
df %>% 
  group_by(week) %>% 
  summarise(SiteA = mean(col_a),
            SiteB = mean(col_b),
            Hospital = mean(col_c))

Which gives you this:

# A tibble: 3 × 4
   week SiteA SiteB Hospital
  <dbl> <dbl> <dbl>    <dbl>
1     1  1.67   3          3
2     2  3      1.5        2
3     3  7      5          4

CodePudding user response:

Just to include a base R answer

One thing I do all the time is use split. You can split the data.frame by the week variable and apply a colMeans function. Using the sapply ensures that the result is combined back into a data.frame - but usually needs a transpose function to resemble the original data.frame:

t(sapply(split(df, df$week, drop = F), colMeans))
#  week    col_a col_b col_c
#1    1 1.666667   3.0     3
#2    2 3.000000   1.5     2
#3    3 7.000000   5.0     4
  •  Tags:  
  • r
  • Related