I am having some problems dealing with a database... I would appreciate a little help.
I want to count how many times an observation (e.g. registration car number) appears within a range of time (e.g. 90 days) in a database. For this, I have a numeric variable of the car number and the date in this format dd/mm/yyyy (the extent of the time series is clearly above 90 days). My intuition is about creating a new variable (mutate) that counts observations filtering by the variable date, but I don’t know how to introduce the time restriction. Any idea?
Let me show an example:
database
id date_dd/mm/yyyy
1 01/01/2021
1 01/02/2021
1 02/02/2021
1 03/02/2021
1 30/12/2021
2 05/07/2021
2 03/03/2021
2 04/12/2021
2 07/07/2021
12 01/05/2021
8 06/07/2021
My main goal is to delete the observations that are repeated more than three times in 90 days. In conclusion, the resulting database would be:
id date_dd/mm/yyyy
1 01/01/2021
1 01/02/2021
1 02/02/2021
1 30/12/2021
2 05/07/2021
2 03/03/2021
2 04/12/2021
2 07/07/2021
12 01/05/2021
8 06/07/2021
As I have explained above my idea has been to create a new variable that counts the number of times each id number is repeated in 90 days and then filter the data by deleting the observations that scores above 3 in the new variable but I don't have any idea of how to create this new variable with this time restriction.
Thanks in advance
CodePudding user response:
Here is one way you could do it:
- use
lubridate
symd
function to get the date format - group by
id
andarrange
- calculate the difference to the first date
- add a row number column
row
- filter for your conditions!
library(dplyr)
library(lubridate)
df %>%
mutate(date_dd.mm.yyyy = dmy(date_dd.mm.yyyy)) %>%
group_by(id) %>%
arrange(date_dd.mm.yyyy, .by_group = TRUE) %>%
mutate(diff = date_dd.mm.yyyy-first(date_dd.mm.yyyy)) %>%
mutate(row = row_number()) %>%
filter(row <=4 | diff < 90) %>%
select(-diff, -row)
id date_dd.mm.yyyy
<int> <date>
1 1 2021-01-01
2 1 2021-02-01
3 1 2021-02-02
4 1 2021-02-03
5 2 2021-03-03
6 2 2021-07-05
7 2 2021-07-07
8 2 2021-12-04
9 8 2021-07-06
10 12 2021-05-01
CodePudding user response:
Use lubridate
to convert your character string dates into Date
. Now you can use dplyr
to group and count the observations by id
, then compare each the observation to the date of the previous observation with lag()
:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(
counter = seq_along(id),
within90days = ifelse(dt <= lag(dt) 90, TRUE, FALSE)
) %>%
filter(counter <= 3 | !within90days)