I have the following data:
import pandas as pd
df = pd.DataFrame({
"id": [1,1,1,2,1,2],
"datetime": [
pd.to_datetime("2020-01-01"),
pd.to_datetime("2020-01-02"),
pd.to_datetime("2020-01-03"),
pd.to_datetime("2020-01-04"),
pd.to_datetime("2020-01-04"),
pd.to_datetime("2020-01-06"),
],
"type": [
"start",
"end",
"start",
"start",
"end",
"end"
]
})
which looks like this
pid datetime type
0 1 2020-01-01 start
1 1 2020-01-02 end
2 1 2020-01-03 start
3 2 2020-01-04 start
4 1 2020-01-04 end
5 2 2020-01-06 end
and I would like to group them in a way that for each pid, a combination of start and closest end dates are grouped in one row. Note that each pid may occur multiple times. So desired result would be:
pid start_date end_date
1 2020-01-01 2020-01-02
1 2020-01-03 2020-01-04
2 2020-01-04 2020-01-06
First approach that came to my mind is looping and perform element-wise matching but that seems quite inefficient to me. Especially because the dataframe has >1m entries. Next thing was a pandas groupby, but I could not find any similar use cases.
Is there a more efficient way to group the data as I described it?
CodePudding user response:
One idea is create counter by GroupBy.cumcount
and then use DataFrame.pivot
:
df['g'] = df.groupby('type').cumcount()
df = (df.pivot(['g','id'], 'type','datetime')[['start','end']]
.add_suffix('_date')
.reset_index(level=0, drop=True))
print (df)
type start_date end_date
id
1 2020-01-01 2020-01-02
1 2020-01-03 2020-01-04
2 2020-01-04 2020-01-06
CodePudding user response:
Try groupby
and agg
with shift
:
df.groupby(["id", df["type"].eq(df["type"].shift(-1))])["datetime"].agg(
start_date="first", end_date="last"
).reset_index(level=0).reset_index(drop=True)
id start_date end_date
0 1 2020-01-01 2020-01-02
1 1 2020-01-03 2020-01-04
2 2 2020-01-04 2020-01-06
>>>