How to filter by range of dates in R?-CodePudding

I am having some problems dealing with a database... I would appreciate a little help.

I want to count how many times an observation (e.g. registration car number) appears within a range of time (e.g. 90 days) in a database. For this, I have a numeric variable of the car number and the date in this format dd/mm/yyyy (the extent of the time series is clearly above 90 days). My intuition is about creating a new variable (mutate) that counts observations filtering by the variable date, but I don’t know how to introduce the time restriction. Any idea?

Let me show an example:

database

    id  date_dd/mm/yyyy
    1   01/01/2021
    1   01/02/2021
    1   02/02/2021
    1   03/02/2021
    1   30/12/2021
    2   05/07/2021
    2   03/03/2021
    2   04/12/2021
    2   07/07/2021
    12  01/05/2021
    8   06/07/2021

My main goal is to delete the observations that are repeated more than three times in 90 days. In conclusion, the resulting database would be:

id  date_dd/mm/yyyy
1   01/01/2021
1   01/02/2021
1   02/02/2021
1   30/12/2021
2   05/07/2021
2   03/03/2021
2   04/12/2021
2   07/07/2021
12  01/05/2021
8   06/07/2021

As I have explained above my idea has been to create a new variable that counts the number of times each id number is repeated in 90 days and then filter the data by deleting the observations that scores above 3 in the new variable but I don't have any idea of how to create this new variable with this time restriction.

Thanks in advance

CodePudding user response：

Here is one way you could do it:

use lubridates ymd function to get the date format
group by id and arrange
calculate the difference to the first date
add a row number column row
filter for your conditions!

library(dplyr)
library(lubridate)
df %>% 
  mutate(date_dd.mm.yyyy = dmy(date_dd.mm.yyyy)) %>% 
  group_by(id) %>% 
  arrange(date_dd.mm.yyyy, .by_group = TRUE) %>% 
  mutate(diff = date_dd.mm.yyyy-first(date_dd.mm.yyyy)) %>% 
  mutate(row = row_number()) %>% 
  filter(row <=4 | diff < 90) %>% 
  select(-diff, -row)

      id date_dd.mm.yyyy
   <int> <date>         
 1     1 2021-01-01     
 2     1 2021-02-01     
 3     1 2021-02-02     
 4     1 2021-02-03     
 5     2 2021-03-03     
 6     2 2021-07-05     
 7     2 2021-07-07     
 8     2 2021-12-04     
 9     8 2021-07-06     
10    12 2021-05-01

CodePudding user response：

Use lubridate to convert your character string dates into Date. Now you can use dplyr to group and count the observations by id, then compare each the observation to the date of the previous observation with lag():

library(tidyverse)

df %>% 
  group_by(id) %>% 
  mutate(
    counter = seq_along(id), 
    within90days = ifelse(dt <= lag(dt)   90, TRUE, FALSE)
  ) %>% 
  filter(counter <= 3 | !within90days)