I have a quite complicated problem and could not find an answer I could apply. In a dataframe with Work, Tasks and Date (String, String, Date), I would like to group according to work but the only when dates are within 3 days of each other:
Work | Task | Date |
---|---|---|
Work 1 | Task 1 | 08-12-2021 |
Work 1 | Task 2 | 09-12-2021 |
Work 1 | Task 3 | 10-12-2021 |
Work 2 | Task 1 | 20-12-2021 |
Work 2 | Task 2 | 20-12-2021 |
Work 2 | Task 3 | 21-12-2021 |
Work 1 | Task 1 | 10-12-2021 |
Work 1 | Task 2 | 11-12-2021 |
Dataset:
In [1]: df = pd.DataFrame([['Work1','Task1','08-12-2021'], ['Work1','Task2','09-12-2021'], ['Work1','Task3','10-12-2021'],['Work2','Task1','20-12-2021'],['Work2','Task2','20-12-2021'],['Work2','Task3','21-12-2021'],['Work1','Task1','10-12-2022'],['Work1','Task2','11-12-2022']], columns=['Work', 'Task','Date'])
What I tried: I calculated a dataframe with Grouped Work, Start Date of Work (agg. min date), Finish Date of Work (agg. max date), First Task, Latest Task.
df2 = df.groupby(by=('Work')).agg(Max=('Date','max'), Min=('Date','min')).reset_index()
df2["Finish_Date"] = df2.merge(df, left_on=["Work", "Max"], right_on=["Work", "Date"])['Task']
df2["Start_Date"] = df2.merge(df, left_on=["Work", "Min"], right_on=["Work", "Date"])['Task']
What I get:
Work | Start Date | Finish Date | First Task | Last Task |
---|---|---|---|---|
Work 1 | 08-12-2021 | 11-12-2022 | Task 1 | Task 2 |
Work 2 | 20-12-2021 | 21-12-2021 | Task 1 | task 3 |
The problem is that the work cannot last more than 3 days and I would like to group according to work but within 3 days of each. I tried very unsuccessfully to create a new column with work start date to distinguish between different works. What I want:
Work | Start Date | Finish Date | First Task | Last Task |
---|---|---|---|---|
Work 1 - 08-12-2021 | 08-12-2021 | 10-12-2022 | Task 1 | Task 3 |
Work 2 - 20-12-2021 | 20-12-2021 | 21-12-2021 | Task 1 | task 3 |
Work 1 - 10-12-2022 | 10-12-2022 | 11-12-2022 | Task 1 | task 3 |
Thank you
CodePudding user response:
Using shift
and cumsum
inside groupby
:
output = ( df.groupby([(df['Work'] != df['Work'].shift()).cumsum()])
.agg(('first', 'last')).drop('Work', axis=1))
output:
Task Date
first last first last
Work
1 Task1 Task3 08-12-2021 10-12-2021
2 Task1 Task3 20-12-2021 21-12-2021
3 Task1 Task2 10-12-2022 11-12-2022
CodePudding user response:
Here's how I would do something like this:
# Assuming the data is sorted by date:
# First, convent the date to a datetime.
df['Date'] = pd.to_datetime(df['Date'], format="%d-%m-%Y")
# then find the gaps of 3 or more days and label the consecutive runs
groupings = (df['Date'].diff().dt.days >= 3).cumsum()
# we can now group by these labels and find the min/max
final = df.groupby(groupings).agg({'Date': ['min','max'], 'Task': ['first','last']})
# Then, to flatten the multiindex, we can use:
final.columns = ['Start Date', 'End Date', 'First Task', 'Last Task']
=====================
Result:
Start Date End Date First Task Last Task
Date
0 2021-12-08 2021-12-10 Task1 Task3
1 2021-12-20 2021-12-21 Task1 Task3
2 2022-12-10 2022-12-11 Task1 Task2
Hope that helps!
EDIT for clarification:
If different Work items are close in date but need to stay in separate groups, you can use groupby(['Work',groupings])
instead.
In addition, if you have something like the following:
Work | Date |
---|---|
T1 | 2022-01-03 |
T2 | 2022-01-05 |
T2 | 2022-01-07 |
T1 | 2022-01-09 |
you may want to use groupings = (df.groupby('Work').Date.diff().dt.days >= 3).cumsum()
so that the two T1
s appear in different groups.