Group by column value and within 3 days of each other-CodePudding

I have a quite complicated problem and could not find an answer I could apply. In a dataframe with Work, Tasks and Date (String, String, Date), I would like to group according to work but the only when dates are within 3 days of each other:

Work	Task	Date
Work 1	Task 1	08-12-2021
Work 1	Task 2	09-12-2021
Work 1	Task 3	10-12-2021
Work 2	Task 1	20-12-2021
Work 2	Task 2	20-12-2021
Work 2	Task 3	21-12-2021
Work 1	Task 1	10-12-2021
Work 1	Task 2	11-12-2021

Dataset:

In [1]: df = pd.DataFrame([['Work1','Task1','08-12-2021'], ['Work1','Task2','09-12-2021'], ['Work1','Task3','10-12-2021'],['Work2','Task1','20-12-2021'],['Work2','Task2','20-12-2021'],['Work2','Task3','21-12-2021'],['Work1','Task1','10-12-2022'],['Work1','Task2','11-12-2022']], columns=['Work', 'Task','Date'])

What I tried: I calculated a dataframe with Grouped Work, Start Date of Work (agg. min date), Finish Date of Work (agg. max date), First Task, Latest Task.

df2 = df.groupby(by=('Work')).agg(Max=('Date','max'), Min=('Date','min')).reset_index()

df2["Finish_Date"] = df2.merge(df, left_on=["Work", "Max"], right_on=["Work", "Date"])['Task']
df2["Start_Date"] = df2.merge(df, left_on=["Work", "Min"], right_on=["Work", "Date"])['Task']

What I get:

Work	Start Date	Finish Date	First Task	Last Task
Work 1	08-12-2021	11-12-2022	Task 1	Task 2
Work 2	20-12-2021	21-12-2021	Task 1	task 3

The problem is that the work cannot last more than 3 days and I would like to group according to work but within 3 days of each. I tried very unsuccessfully to create a new column with work start date to distinguish between different works. What I want:

Work	Start Date	Finish Date	First Task	Last Task
Work 1 - 08-12-2021	08-12-2021	10-12-2022	Task 1	Task 3
Work 2 - 20-12-2021	20-12-2021	21-12-2021	Task 1	task 3
Work 1 - 10-12-2022	10-12-2022	11-12-2022	Task 1	task 3

Thank you

CodePudding user response：

Using shift and cumsum inside groupby:

output =  ( df.groupby([(df['Work'] != df['Work'].shift()).cumsum()])
                   .agg(('first', 'last')).drop('Work', axis=1))

output:

    Task            Date
   first    last    first         last
Work                
1   Task1   Task3   08-12-2021  10-12-2021
2   Task1   Task3   20-12-2021  21-12-2021
3   Task1   Task2   10-12-2022  11-12-2022

CodePudding user response：

Here's how I would do something like this:

# Assuming the data is sorted by date:
# First, convent the date to a datetime.
df['Date'] = pd.to_datetime(df['Date'], format="%d-%m-%Y")

# then find the gaps of 3 or more days and label the consecutive runs
groupings = (df['Date'].diff().dt.days >= 3).cumsum()

# we can now group by these labels and find the min/max
final = df.groupby(groupings).agg({'Date': ['min','max'], 'Task': ['first','last']})

# Then, to flatten the multiindex, we can use:
final.columns = ['Start Date', 'End Date', 'First Task', 'Last Task']

===================== 
Result:
     Start Date   End Date First Task Last Task
Date
0    2021-12-08 2021-12-10      Task1     Task3
1    2021-12-20 2021-12-21      Task1     Task3
2    2022-12-10 2022-12-11      Task1     Task2

Hope that helps!

EDIT for clarification:

If different Work items are close in date but need to stay in separate groups, you can use groupby(['Work',groupings]) instead.

In addition, if you have something like the following:

Work	Date
T1	2022-01-03
T2	2022-01-05
T2	2022-01-07
T1	2022-01-09

you may want to use groupings = (df.groupby('Work').Date.diff().dt.days >= 3).cumsum() so that the two T1s appear in different groups.