Home > OS >  Pandas: Group start/end events into intervals
Pandas: Group start/end events into intervals

Time:10-12

I have the following data:

import pandas as pd

df = pd.DataFrame({
    "id": [1,1,1,2,1,2],
    "datetime": [
        pd.to_datetime("2020-01-01"),
        pd.to_datetime("2020-01-02"),
        pd.to_datetime("2020-01-03"),
        pd.to_datetime("2020-01-04"),
        pd.to_datetime("2020-01-04"),
        pd.to_datetime("2020-01-06"),
        ],
    "type": [
        "start",
        "end",
        "start",
        "start",
        "end",
        "end"
    ]
})

which looks like this

    pid datetime    type
0   1   2020-01-01  start
1   1   2020-01-02  end
2   1   2020-01-03  start
3   2   2020-01-04  start
4   1   2020-01-04  end
5   2   2020-01-06  end

and I would like to group them in a way that for each pid, a combination of start and closest end dates are grouped in one row. Note that each pid may occur multiple times. So desired result would be:

pid  start_date  end_date
1   2020-01-01  2020-01-02
1   2020-01-03  2020-01-04
2   2020-01-04  2020-01-06

First approach that came to my mind is looping and perform element-wise matching but that seems quite inefficient to me. Especially because the dataframe has >1m entries. Next thing was a pandas groupby, but I could not find any similar use cases.

Is there a more efficient way to group the data as I described it?

CodePudding user response:

One idea is create counter by GroupBy.cumcount and then use DataFrame.pivot:

df['g'] = df.groupby('type').cumcount()

df = (df.pivot(['g','id'], 'type','datetime')[['start','end']]
        .add_suffix('_date')
        .reset_index(level=0, drop=True))
print (df)
type start_date   end_date
id                        
1    2020-01-01 2020-01-02
1    2020-01-03 2020-01-04
2    2020-01-04 2020-01-06

CodePudding user response:

Try groupby and agg with shift:

df.groupby(["id", df["type"].eq(df["type"].shift(-1))])["datetime"].agg(
    start_date="first", end_date="last"
).reset_index(level=0).reset_index(drop=True)

   id start_date   end_date
0   1 2020-01-01 2020-01-02
1   1 2020-01-03 2020-01-04
2   2 2020-01-04 2020-01-06
>>> 
  • Related