Considering the data below (see also DB Fiddle example), how can I identify quasi-streaks of days (i.e. the rows that have contact day close by, but not necesarilly consecutive; I want to allow an arbitrary gap)?
Original columns in the data are person_id
and contact_day
, what I would like is to partition by person_id and the streak (group of days close by). My approach so far is to first calculate the number of days since last contact (days_last_contact
) and then try to use that to calculate the column dash_group
, which indicates which rows are within the threshold difference (in this case, 3 days). How can I calculate dash_group
?
person_id | contact_day | days_last_contact | dash_group |
---|---|---|---|
1 | 2015-02-09 | 1 | |
1 | 2015-05-01 | 81 | 2 |
1 | 2015-05-02 | 1 | 2 |
1 | 2015-05-03 | 1 | 2 |
1 | 2015-06-01 | 29 | 3 |
1 | 2015-08-01 | 61 | 4 |
1 | 2015-08-04 | 3 | 4 |
1 | 2015-09-01 | 28 | 5 |
2 | 2015-05-01 | 1 | |
2 | 2015-06-01 | 31 | 2 |
2 | 2015-07-01 | 30 | 3 |
3 | 2015-05-01 | 1 | |
3 | 2015-05-02 | 1 | 1 |
3 | 2015-05-04 | 2 | 1 |
3 | 2015-06-01 | 28 | 2 |
3 | 2015-06-02 | 1 | 2 |
3 | 2015-06-06 | 4 | 3 |
I calculate days_last_contact
by substracting the contact_day and it's 1-lag (partitioned by persion and sorted by date).
SELECT
contact_day - lag(contact_day, 1, NULL)
OVER (PARTITION BY person_id ORDER BY contact_day ASC)
AS days_last_contact
FROM mydata
;
But how can I then use that to group together the rows where days_last_contact
is below some threshold? (3 days in this example). So, in this example, dash_group
2 for person_id
1 identify the days May 1, 2 and 3 are close by, but then the next date for that person is June 1, which is too far (29 days since last contact, which is greater than the threshold of 3), so it gets a new dash_group
. Similarly, dash_group
4, groups together August 1 and August 4, because the difference is 3, but in the case of June 2 and June 6 (person 3), the difference is 4 and then they are classified in different groups.
After looking around, I've found for example this SO question where they point to the 'trick' #4 here, which is nicely hacky, but only works for consecutive dates / gapless series, and I need to allow for arbitrary gaps.
CodePudding user response:
Define your partition/pattern, from what you provided date data, the pattern is same personid, partition rank by month.
So the window clause should be order by person_id, yearandmonth then the partition should be person_id.
the key point is get/compute the pattern.So here I deduced/guess the pattern is year/month pattern.
alter table mydata add column dateym text;
update mydata set dateym = to_char(contact_day,'YYYY-MM');
SELECT
person_id,
contact_day,
dateym,
rank() OVER (PARTITION BY (person_id, dateym) ORDER BY person_id,
contact_day),
dense_rank() OVER (PARTITION BY person_id ORDER BY person_id, dateym)
FROM
mydata;
The gaps can be arbitrary, but it should be expressed via conditional expression.
CodePudding user response:
Using a recursive query:
WITH RECURSIVE zzz AS (
SELECT person_id
, contact_day
, md.days_last_contact
, row_number() OVER(PARTITION BY person_id ORDER BY contact_day)
AS dash_group
FROM mydata md
WHERE NOT EXISTS ( -- only the group *leaders*
SELECT * FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day < md.contact_day
AND nx.contact_day >= md.contact_day -3
)
UNION ALL
SELECT md.person_id
, md.contact_day
, md.days_last_contact
, zzz.dash_group
FROM zzz
JOIN mydata md ON md.person_id = zzz.person_id
AND md.contact_day > zzz.contact_day
AND md.contact_day <= zzz.contact_day 3
AND NOT EXISTS ( SELECT * -- eliminate the middle men ...
FROM mydata nx
WHERE nx.person_id = md.person_id
AND nx.contact_day > zzz.contact_day
AND nx.contact_day < md.contact_day
)
)
SELECT * FROM zzz
ORDER BY person_id,contact_day
;
There will probably be a shorter solution using window functions.