I have a large dataframe (Merged) of several thousand lines and 20 columns. Each line is a record of a site visit. There are 400 sites in the dataframe and each has multiple lines for visits on different months. . Example:
Site | Date | Method | Result | ChangeDate |
---|---|---|---|---|
2200 | 01/07/2021 | 24 | 0.22 | 01/07/2021 |
2201 | 03/07/2021 | 21 | 0.01 | 01/01/2021 |
2203 | 03/07/2021 | 21 | 0.02 | 01/01/2021 |
2200 | 01/08/2021 | 21 | 0.05 | 01/07/2021 |
2201 | 03/08/2021 | 21 | 0.01 | 01/01/2021 |
2203 | 03/08/2021 | 21 | 0.02 | 01/01/2021 |
2200 | 01/09/2021 | 21 | 0.05 | 01/07/2021 |
And so on for all 400 sites over two years.
I need to calculate the mean result for each site, which I can do:
OPmean <- aggregate(Merged$Result, list(Merged$Site), FUN=mean, na.rm = TRUE
But I only want to take the means after the change date. So on site 2200 I would want to take the mean on these lines only:
Site | Date | Method | Result | ChangeDate |
---|---|---|---|---|
2200 | 01/08/2021 | 21 | 0.05 | 01/07/2021 |
2200 | 01/09/2021 | 21 | 0.05 | 01/07/2021 |
Leaving the first line for 2200 off because the date occurs on (or before) the change date.
For an output I would want a new dataframe something like this:
Site | Mean Result |
---|---|
2200 | 0.05 |
2201 | 0.01 |
2203 | 0.02 |
CodePudding user response:
Small change to your code
aggregate(
Result~Site,
data=df[df$Date>df$ChangeDate,],
FUN=mean
)
Site Result
1 2200 0.05
2 2201 0.01
3 2203 0.02
CodePudding user response:
@user2974951 gives a great base R answer. A tidyverse approach might be
library(dplyr)
Merged %>%
group_by(Site) %>%
filter(Date > ChangeDate) %>%
summarize(mr = mean(Result) # or optionally, mean(Result, na.rm = TRUE)