I am trying to calculate the number of apples
and pears
lost from fruit stands (plots
) at a market in a given year
. Here, a loss would be defined as when the number of apples or pears decreases but remains at that number and does not increase again in that plot for that year. In other words, an apple or pear can be lost from the plot, but if another is added (e.g. re-stock), then this does not constitute a "loss". I am looking to summarize the number of apples lost and the number of pears lost by year, and also by year and plot. The date order is important here (i.e. a loss cannot happen from a date in the future to a date in the past), but I have already sorted my dataset by year so this should not be an issue.
Here is an example of the data:
table <- "date year plot apples pears
1 2021-05-26 2020 a 1 1
2 2021-05-27 2020 a 1 1
3 2021-05-28 2020 a 0 1
4 2021-05-29 2020 a 1 1
5 2021-05-30 2020 a 1 1
6 2021-05-27 2021 b 2 1
7 2021-05-28 2021 b 2 1
8 2021-05-29 2021 b 1 0
9 2021-05-30 2021 b 1 0
10 2021-05-31 2021 b 1 0
11 2021-05-27 2021 c 1 0
12 2021-05-28 2021 c 1 1
13 2021-05-29 2021 c 0 1
14 2021-05-30 2021 c 0 1
15 2021-05-31 2021 c 0 1"
Based on this example, you would expect:
- In 2020, there were no apples lost and no pears lost (the number did not decrease and remain at that decreased number).
- In 2021, there were two apples lost (one in plot b and one in plot c) and 1 pear lost (in plot b)
Which as an output would look similar to this summarized by year:
table <- "date year apples.lost pears.lost
1 2020 0 0
2 2021 2 1"
Or this if also grouped by plot:
table <- "date year plot apples.lost pears.lost
1 2020 a 0 0
2 2021 b 1 1
3 2021 c 1 0"
I have spent hours trying to figure out how to do this and I cannot come up with viable code. I can calculate increases/decreases in datasets, based on resources such as this, but I cannot seem to find a way to work in counting only decreases that remain at that number for the remainder of the year in that specific plot.
CodePudding user response:
Using the dplyr
package:
library(dplyr)
df %>%
group_by(year, plot) %>%
summarise(apples.lost = max(first(apples) - last(apples), 0),
pears.lost = max(first(pears) - last(pears), 0)) %>%
ungroup()
#> # A tibble: 3 x 4
#> year plot apples.lost pears.lost
#> <dbl> <chr> <dbl> <dbl>
#> 1 2020 a 1 0
#> 2 2021 b 1 1
#> 3 2021 c 1 0
To get the total sum per year, you'd summarise it again:
df %>%
group_by(year, plot) %>%
summarise(apples.lost = max(first(apples) - last(apples), 0),
pears.lost = max(first(pears) - last(pears), 0)) %>%
group_by(year) %>%
summarise(apples.lost = sum(apples.lost),
pears.lost = sum(pears.lost))
#> # A tibble: 2 x 3
#> year apples.lost pears.lost
#> <dbl> <dbl> <dbl>
#> 1 2020 1 0
#> 2 2021 2 1
CodePudding user response:
A data.table
solution:
library(data.table)
dt <- fread(text = "obs date year plot apples pears
1 2021-05-26 2020 a 1 1
2 2021-05-27 2020 a 1 1
3 2021-05-28 2020 a 0 1
4 2021-05-29 2020 a 1 1
5 2021-05-30 2020 a 1 1
6 2021-05-27 2021 b 2 1
7 2021-05-28 2021 b 2 1
8 2021-05-29 2021 b 1 0
9 2021-05-30 2021 b 1 0
10 2021-05-31 2021 b 1 0
11 2021-05-27 2021 c 1 0
12 2021-05-28 2021 c 1 1
13 2021-05-29 2021 c 0 1
14 2021-05-30 2021 c 0 1
15 2021-05-31 2021 c 0 1")
dt[, .(apples.lost = max(0L, first(apples) - last(apples)), pears.lost = max(0L, first(pears) - last(pears))), by = year:plot]
#> year plot apples.lost pears.lost
#> 1: 2020 a 0 0
#> 2: 2021 b 1 1
#> 3: 2021 c 1 0