I have a system which records sanctions against clients' names.
There should only ever be one active sanction, yet there are some cases where there are multiple active sanctions.
I would like to know how I can count how many people had two or more simultaneously-active sanctions over the past three years (sample data ranges from 2019-2022, so this won't need to be filtered in the solution).
The way I would work this out is to detect those cases where start_date2
occurs before end_date1
.
Sample data (note that the end_date
values are random, so there may be several cases of them occurring before their respective start_date
values, but bear in mind that this is just sample data, so take it with a pinch of salt):
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
The algorithm in my mind goes like this:
for each client
check if there was more than one active sanction at any one time
look for cases where start_date2/start_date3/start_dateY occurs before end_date1/end_date2/end_dateX
group by month-year (using month_year column)
The output I am looking for is a monthly breakdown, indicating how many simultaneous sanctions occurred per month. Something like this:
01-2020: 10
02-2020: 35
03-2020: 29
...
01-2022: 5
I believe that I have covered everything, but am happy to clarify anything where required/requested.
CodePudding user response:
Updated, given clarifications in comment section
If we do this without regard to client, then we have something like this:
sanc %>% arrange(start_date) %>%
mutate(same_as_prev = start_date<lag(end_date) |row_number()==1 & end_date>lead(start_date)) %>%
group_by(start_month_year) %>%
summarize(simActive = sum(same_as_prev))
Output:
# A tibble: 37 x 2
start_month_year simActive
<chr> <int>
1 2019-01 29
2 2019-02 26
3 2019-03 30
4 2019-04 26
5 2019-05 25
6 2019-06 19
7 2019-07 19
8 2019-08 26
9 2019-09 21
10 2019-10 23
# ... with 27 more rows
It seems that in your sample data, all the clients have only one row, so I've adjusted it so that each of 200 clients has 5 rows. I then do something rather simple:
sanc %>% as_tibble() %>%
group_by(client, active = cumsum(start_date>lag(end_date) & row_number()>1)) %>%
filter(n()>1) %>%
ungroup() %>%
distinct(client, active) %>%
count(client, name="simActive")
This returns a list of clients, along with the number of times the client had simultaneous active sanctions.
Output:
# A tibble: 193 x 2
client simActive
<int> <int>
1 1 1
2 2 1
3 3 2
4 4 1
5 5 2
6 6 2
7 7 1
8 8 1
9 9 1
10 10 1
# ... with 183 more rows
So for client 1, there was one time when there was 2 or more active sanctions. The data for client one (see input below) looks like this, and this client had rows 3 and 4 active at the same time.
client start_date end_date start_month_year
1 1 2019-03-18 2019-09-25 2019-03
2 1 2020-10-19 2019-12-03 2020-10
3 1 2021-03-11 2019-11-26 2021-03
4 1 2020-07-06 2021-09-03 2020-07
5 1 2021-05-11 2019-09-06 2021-05
Input:
set.seed(147)
sanc <-
data.frame(
client = rep(1:200, each = 5),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000),
end_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-01-01"), by = "day"), 1000)
)
sanc$start_month_year = format(as.Date(sanc$start_date, "%Y-%m-%d"), "%Y-%m")
CodePudding user response:
Here is another way to do it. It might not be very performant, but the approach should yield the correct results. See my inline comments for how it works. Further note, that I adjusted your sample data. You did just sample random start and end dates without making sure that start_date < end_date
. I changed this so that each start_date
is smaller than its end_date
.
set.seed(147)
library(dplyr)
sanc <-
tibble(
client = sample(1:500, 1000, replace = TRUE),
start_date = sample(seq(as.Date("2019-01-01"), as.Date("2022-06-01"), by = "day"), 1000),
end_date = round(runif(1000, min = 1, max = 150), 0 ) start_date
)
sanc %>%
# make each sanction an `lubridate::interval`
mutate(int = interval(start_date, end_date)) %>%
# group_by month and client
group_by(month = format(start_date, "%Y-%m"), client) %>%
# use `lubridate::int_overlaps` to compare all intervals
summarise(overlap = list(outer(int, int, int_overlaps))) %>%
# apply to each row ...
rowwise() %>%
# to get only the lower triangle of each matrix and sum it up
mutate(overlap = sum(overlap[lower.tri(overlap)])) %>%
# now group by month
group_by(month) %>%
# and how many individuals in each month have more than one active sanction
summarise(overlap = sum(overlap))
#> `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.
#> # A tibble: 42 x 2
#> month overlap
#> <chr> <int>
#> 1 2019-01 0
#> 2 2019-02 0
#> 3 2019-03 0
#> 4 2019-04 0
#> 5 2019-05 0
#> 6 2019-06 1
#> 7 2019-07 1
#> 8 2019-08 2
#> 9 2019-09 1
#> 10 2019-10 3
#> # ... with 32 more rows
Created on 2022-03-09 by the reprex package (v2.0.1)