How to aggregate and average various rows based on multiple groups while keeping other columns intac-CodePudding

I have spent hours trying to figure out how to do this but I am not sure what the best approach would be. I have species and environmental data with 2 replicates per site (for each particular day and month) and would like to combine data collected at the 'S' and 'E' tows for each particular site, month, and day. I am doing some analyses and want to combine the 'S' and 'E' tows so there is only one row of data for both tows in each site (by day and month). I am not sure how to verbally explain this very well so I will try to show an example to better explain myself.

Here is a shortened version of my data:

structure(list(month = c("11", "11", "11", "11", "11", "11", 
"7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4", 
"4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27", 
"16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L, 
10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E", 
"S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L, 
14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239, 
12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915, 
23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12, 
10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48, 
2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7, 
2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0, 
0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767, 
0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0, 
0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552, 
1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781, 
0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894, 
0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
)), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L, 
35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame")

For example, my first two rows (month == 11, day == 4, Site == 11) should end up as 1 row with the 'Temp' and 'DO_mgL' of the two rows averaged, the 'secchi' readings are always the same for the 'S' and 'E' tows so that should stay the same and the species densities should be summed(added). The tow column can be removed after this is done. I want to end up with something like this (just showing example of what the first two rows should end up like).

month	day	Site	Depth	Temp	DO_mgL	secchi	d.lept	d.byths	d.daph
11	4	11	10.5	12.83	10.50	1.25	0	0	1.15013

To be honest, I am not even sure where to begin to achieve this. Doing the following achieves somewhat what I want for my species but this only does it for one species at a time (I have a total of 8 species, shortened in this example) and deletes other columns:

aggregate(d.lept ~ month   day   Site, data=zp1, FUN = sum)

Again, I need the 'S' and 'E' tows to be treated as one collection by:

having the 'Temp' and 'DO_mgL' averaged between the 'S' and 'E' tows for each site, day and month
keeping 'secchi' as is since the value is the same for each 'S' and 'E' combo
adding the species densities between the 'S' and 'E' tows for each site, day, and month

To further complicate things, due to time/weather constraints in the field sometimes we were not able to collect replicates so some sites only have data with 'S' tows, those should be left intact since there is only one row for those particular sites/day/months.

My entire dataset has 97 rows and 16 columns. I have a total of 24 sites that were sampled in July, August, and September. I have 8 species with their associated densities (derived from counts).

I looked at the following posts concerning the aggregate portion of my question, but they did not quite help me: here, here, and here.

I hope this is clear and makes sense, but I am happy to provide further clarification. Thank you for your time.

CodePudding user response：

What you're asking for can be done using a standard group_by summarise combination with the tidyverse. More reading can be found here.

library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test

df <- structure(list(month = c("11", "11", "11", "11", "11", "11", 
"7", "7", "7", "7", "7", "7", "8", "8", "8", "8"), day = c("4", 
"4", "4", "4", "5", "5", "20", "20", "27", "27", "27", "27", 
"16", "16", "16", "16"), Site = c(11L, 11L, 6L, 6L, 9L, 9L, 10L, 
10L, 13L, 13L, 2L, 3L, 4L, 5L, 5L, 6L), Tow = c("E", "S", "E", 
"S", "E", "S", "E", "S", "E", "S", "S", "S", "S", "E", "S", "E"
), Depth = c(10L, 11L, 22L, 22L, 12L, 13L, 13L, 13L, 19L, 19L, 
14L, 21L, 22L, 22L, 22L, 22L), Temp = c(12.75, 12.9, 14.25, 14.239, 
12.975, 12.955, 23.804, 23.804, 23.89, 23.9, 24.41, 24.04, 23.915, 
23.988, 24.021, 23.957), DO_mgL = c(10.54, 10.45, 10.16, 10.12, 
10.4, 10.39, 7.24, 7.11, 8.07, 8.1, 9.14, 1.29, 2.44, 2.45, 2.48, 
2.54), secchi = c(1.25, 1.25, 2.25, 2.25, 1.5, 1.5, 2.7, 2.7, 
2.1, 2.1, 2.75, 1.25, 2.8, 3, 3, 3.25), d.lept = c(0, 0, 0, 0, 
0, 0, 0.008037479, 0.155240934, 0.128494423, 0.025249815, 0.053921767, 
0.012391113, 0.069338871, 0.022259485, 0.013767903, 0.046661095
), d.byths = c(0, 0, 0, 0, 0, 0, 0, 0.007392425, 0, 0, 0, 0, 
0, 0.044518969, 0.013767903, 0.015553698), d.daph = c(0.140036552, 
1.010093452, 1.629907953, 2.762608821, 1.130338642, 1.311853781, 
0.031419235, 0.029569702, 0.0525659, 0.084166051, 0.024509894, 
0.049564452, 0.104008307, 0.133556908, 0.082607421, 0.062214794
)), row.names = c(1L, 2L, 3L, 4L, 21L, 22L, 23L, 24L, 33L, 34L, 
35L, 36L, 58L, 59L, 60L, 61L), class = "data.frame") %>% 
  as_tibble() %>% 
  clean_names()

df %>% 
  group_by(month, day, site) %>% 
  summarise(avg_temp = mean(temp),
            avg_do_mg_l = mean(do_mg_l),
            secchi = secchi,
            sum_d_lept = sum(d_lept),
            sum_d_byths = sum(d_byths),
            sum_d_daph = sum(d_daph),
            .groups = "drop")
#> # A tibble: 16 x 9
#>    month day    site avg_temp avg_do_mg_l secchi sum_d_lept sum_d_byths
#>    <chr> <chr> <int>    <dbl>       <dbl>  <dbl>      <dbl>       <dbl>
#>  1 11    4         6     14.2       10.1    2.25     0          0      
#>  2 11    4         6     14.2       10.1    2.25     0          0      
#>  3 11    4        11     12.8       10.5    1.25     0          0      
#>  4 11    4        11     12.8       10.5    1.25     0          0      
#>  5 11    5         9     13.0       10.4    1.5      0          0      
#>  6 11    5         9     13.0       10.4    1.5      0          0      
#>  7 7     20       10     23.8        7.18   2.7      0.163      0.00739
#>  8 7     20       10     23.8        7.18   2.7      0.163      0.00739
#>  9 7     27        2     24.4        9.14   2.75     0.0539     0      
#> 10 7     27        3     24.0        1.29   1.25     0.0124     0      
#> 11 7     27       13     23.9        8.09   2.1      0.154      0      
#> 12 7     27       13     23.9        8.09   2.1      0.154      0      
#> 13 8     16        4     23.9        2.44   2.8      0.0693     0      
#> 14 8     16        5     24.0        2.46   3        0.0360     0.0583 
#> 15 8     16        5     24.0        2.46   3        0.0360     0.0583 
#> 16 8     16        6     24.0        2.54   3.25     0.0467     0.0156 
#> # ... with 1 more variable: sum_d_daph <dbl>

^{Created on 2022-04-28 by the reprex package (v2.0.1)}

CodePudding user response：

It sounds like you want to collapse the rows in each group into one row (??).

With data.table:

library(data.table)
##
#
setDT(df)[, .(
  Temp    = mean(Temp),
  DO_mgL  = mean(DO_mgL),
  secchi  = mean(secchi),
  d.lept  = sum(d.lept),
  d.byths = sum(d.byths),
  d.daph  = sum(d.daph)
), by=.(month, day, Site)]

##     month day Site    Temp DO_mgL secchi     d.lept     d.byths     d.daph
##  1:    11   4   11 12.8250 10.495   1.25 0.00000000 0.000000000 1.15013000
##  2:    11   4    6 14.2445 10.140   2.25 0.00000000 0.000000000 4.39251677
##  3:    11   5    9 12.9650 10.395   1.50 0.00000000 0.000000000 2.44219242
##  4:     7  20   10 23.8040  7.175   2.70 0.16327841 0.007392425 0.06098894
##  5:     7  27   13 23.8950  8.085   2.10 0.15374424 0.000000000 0.13673195
##  6:     7  27    2 24.4100  9.140   2.75 0.05392177 0.000000000 0.02450989
##  7:     7  27    3 24.0400  1.290   1.25 0.01239111 0.000000000 0.04956445
##  8:     8  16    4 23.9150  2.440   2.80 0.06933887 0.000000000 0.10400831
##  9:     8  16    5 24.0045  2.465   3.00 0.03602739 0.058286872 0.21616433
## 10:     8  16    6 23.9570  2.540   3.25 0.04666109 0.015553698 0.06221479

setDT(df) converts your df to a data.table (no need for a tibble). The by=.(...) clause defines the groups, and the clause .(...) does the aggregating.