Home > front end >  dplyr: extraction of max values by a specific condition
dplyr: extraction of max values by a specific condition

Time:09-28

I have a data set (ds) with the variables date, values and break.. I'd like to create a rule where a preserved de high value between two events of break == "True" and the other values must be 0. In my example:

# Packages
library(dplyr)
library(lubridate)

# My data set
ds <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/accumulated_values_table")
ds$date <- ymd(ds$date)
ds <- ds %>%
  mutate(break. = break. == "True")
str(ds)
# 'data.frame': 4458 obs. of  4 variables:
#  $ id    : int  0 1 2 2 1 0 3 2 4 1 ...
#  $ date  : Date, format: "2020-11-26" "2020-11-26" "2020-11-26" "2020-11-27" ...
#  $ values: int  36 33 27 22 34 28 5 18 28 24 ...
#  $ break.: logi  FALSE FALSE FALSE FALSE TRUE FALSE ...

With this rule my desirable output for the ìd ==4 for example is the creation of values2 column below:

ds.id4 <- ds[ds$id==4,]
ds.id4 [1:11,]
#    id       date values break. values2
# 9   4 2020-11-28     28  FALSE      28
# 15  4 2020-11-29     19  FALSE       0
# 16  4 2020-11-30     21  FALSE       0
# 20  4 2020-12-01     24  FALSE       0
# 26  4 2020-12-02     20   TRUE       0
# 30  4 2020-12-03      8  FALSE       0
# 36  4 2020-12-04     11   TRUE      11
# 43  4 2020-12-05      0  FALSE       0
# 49  4 2020-12-06      0  FALSE       0
# 55  4 2020-12-07      1  FALSE       0
# 63  4 2020-12-08      3   TRUE       3
# ...

Please, could someone help me with it? Thanks in advance!

CodePudding user response:

You can create a new grouping variable with the help of cumsum in each id and keep the value if it is max in the group.

library(dplyr)

result <- ds %>%
  arrange(id) %>%
  group_by(id, group = cumsum(lag(break., default = TRUE))) %>%
  mutate(values2 = ifelse(values == max(values), values, 0)) %>%
  ungroup()

Checking the output.

result %>% filter(id == 4) %>% head(12)

#      id date       values break. group values2
#   <int> <date>      <int> <lgl>  <int>   <dbl>
# 1     4 2020-11-28     28 FALSE     51      28
# 2     4 2020-11-29     19 FALSE     51       0
# 3     4 2020-11-30     21 FALSE     51       0
# 4     4 2020-12-01     24 FALSE     51       0
# 5     4 2020-12-02     20 TRUE      51       0
# 6     4 2020-12-03      8 FALSE     52       0
# 7     4 2020-12-04     11 TRUE      52      11
# 8     4 2020-12-05      0 FALSE     53       0
# 9     4 2020-12-06      0 FALSE     53       0
#10     4 2020-12-07      1 FALSE     53       0
#11     4 2020-12-08      3 TRUE      53       3
#12     4 2020-12-09      0 FALSE     54       0

You can drop the group column if not needed with select(-group).

  • Related