Home > Blockchain >  Pure python or itertools group a list of dates by days difference between each date
Pure python or itertools group a list of dates by days difference between each date

Time:10-26

Having a list of dates ordered:

[
datetime.date(2006, 8, 15),
datetime.date(2006, 9, 12),
datetime.date(2007, 8, 10),
datetime.date(2021, 4, 6),
datetime.date(2021, 4, 16),
datetime.date(2021, 4, 19)
...
]

I would like to have groups that contain dates that are a maximum of 30 days between all dates (the distance between first element of the group and the last of these group will be <= 30 days)

For example, using the previous list, I will get:

  • group_1 = [datetime.date(2006, 8, 15), datetime.date(2006, 9, 12)] (dates are <= 30 days between)
  • group_2 = [datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)] <=30
  • group_3 = [datetime.date(2007, 8, 10)] (no other dates related <= 30 days)

I tried to use iter-tools groupby but the key function doesn't allow 2 dates comparation like "lambda x,y: (x-y).days <= 30...." I don't know if I can use groupby for solve this problem or I need other itertools function. I know that I could build a python algorithm for it but I think that will exist a simple option to solve this but I didn't found it :(

Thanks!

CodePudding user response:

Here is a pandas solution that compares the dates with the next date and checks whether there are 30 days in between. It then assigns a group number with cumsum:

import pandas as pd
import datetime

data = [ datetime.date(2006, 8, 15), datetime.date(2006, 9, 12), datetime.date(2007, 8, 10), datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)]

df = pd.DataFrame(data, columns=['date'])
df['groups'] = ((df['date'] - df['date'].shift(1)) > pd.Timedelta(30, unit='D')).cumsum()

Output:

date groups
0 2006-08-15 0
1 2006-09-12 0
2 2007-08-10 1
3 2021-04-06 2
4 2021-04-16 2
5 2021-04-19 2

Or if you just want a list of lists as output: df.groupby(((df['date'] - df['date'].shift(1)) > pd.Timedelta(30, unit='D')).cumsum()).agg(list)['date'].to_list()

CodePudding user response:

An iterative solution with plain old for-loop is quite straightforward in this case.

I don't think it will be easy or efficient to use itertools to solve this problem since the grouping in this case depends on the context of the data, which will probably yield a O(N^2) solution, whereas iterative approach is O(N).

dts = [
datetime.date(2006, 8, 15),
datetime.date(2006, 9, 12),
datetime.date(2007, 8, 10),
datetime.date(2021, 4, 6),
datetime.date(2021, 4, 16),
datetime.date(2021, 4, 19)
]

def groupDateTimes(dts):
    i = 0
    ans = []
    group = []
    delta30days = datetime.timedelta(days=30)
    while i < len(dts):
        cur = dts[i]
        if not group:
            group.append(cur)
        elif cur - group[0] <= delta30days:
            group.append(cur)
        else:
            ans.append(group)
            group = [cur]
        i  = 1
    if group:
        ans.append(group)
    return ans

print(groupDateTimes(dts)) // [[datetime.date(2006, 8, 15), datetime.date(2006, 9, 12)], [datetime.date(2007, 8, 10)], [datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)]]

CodePudding user response:

itertools.groupby is meant for grouping items that are not dependent on each other. In your case it is both easier and clearer to simply iterate through the list of dates to build list of groups by inserting a new group when the current date is more than 30 days later than the first date of the current group:

dates = [
    datetime.date(2006, 8, 15),
    datetime.date(2006, 9, 12),
    datetime.date(2007, 8, 10),
    datetime.date(2021, 4, 6),
    datetime.date(2021, 4, 16),
    datetime.date(2021, 4, 19),
    datetime.date(2021, 5, 7)
]
threshold = datetime.timedelta(30)
groups = []
for date in dates:
    if not groups or date - group[0] > threshold:
        group = []
        groups.append(group)
    group.append(date)

groups would become:

[[datetime.date(2006, 8, 15), datetime.date(2006, 9, 12)],
 [datetime.date(2007, 8, 10)],
 [datetime.date(2021, 4, 6), datetime.date(2021, 4, 16), datetime.date(2021, 4, 19)],
 [datetime.date(2021, 5, 7)]]
  • Related