Imagine that I have a table like this
Start (day) | End (day) | Duration (days) | Status | Count |
---|---|---|---|---|
1 | 2 | 1 | 1 | 3 |
2 | 4 | 2 | 2 | 4 |
4 | 5 | 1 | 3 | 2 |
5 | 8 | 3 | 2 | 1 |
8 | 9 | 1 | 3 | 6 |
9 | 15 | 6 | 3 | 1 |
15 | 16 | 1 | 3 | 2 |
I want to merge near rows that has the same status. I want to let the start day being the start day of the first row of the sequence and the end day being the one of the last row. I also need to have the duration and the count summed, so in this example, I would want this result with the three last rows merged.
Start (day) | End (day) | Duration (days) | Status | Count |
---|---|---|---|---|
1 | 2 | 1 | 1 | 3 |
2 | 4 | 2 | 2 | 4 |
4 | 5 | 1 | 3 | 2 |
5 | 8 | 3 | 2 | 1 |
8 | 16 | 8 | 3 | 9 |
How should I do this?
Please, help me.
CodePudding user response:
One way using cumsum
trick with pandas.DataFrame.groupby.agg
:
m = df["Status"].ne(df["Status"].shift()).cumsum()
new_df = df.groupby(m, as_index=False).agg({"Start (day)": "first",
"End (day)": "last",
"Duration (days)": "sum",
"Status": "first",
"Count": "sum"})
Output:
Start (day) End (day) Duration (days) Status Count
0 1 2 1 1 3
1 2 4 2 2 4
2 4 5 1 3 2
3 5 8 3 2 1
4 8 16 8 3 9