Home > front end >  Efficiently combining groupby, last and count in pandas
Efficiently combining groupby, last and count in pandas

Time:03-10

From a list of logs, i want to get the number of active events at each timestamp for a specific event type.

A sample log input looks like this:

time id event
2022-03-01 10:00 1 A
2022-03-01 11:00 2 B
2022-03-01 12:00 3 A
2022-03-01 13:00 1 B
2022-03-01 14:00 4 A
2022-03-01 15:00 2 C
2022-03-01 16:00 1 A
... ... ...

What i want is basically how many ids have event A active at each time in the df, like in the table below.

time eventA
2022-03-01 10:00 1
2022-03-01 11:00 1
2022-03-01 12:00 2
2022-03-01 13:00 1
2022-03-01 14:00 2
2022-03-01 15:00 2
2022-03-01 16:00 3
... ...

I achieved this with some basic pandas operations:

df = pd.DataFrame(
    {
        "time": pd.date_range("2022-03-01 10:00", periods=7, freq="H"),
        "id": [1, 2, 3, 1, 4, 2, 1],
        "event": ["A", "B", "A", "B", "A", "C", "A"],
    }
)
timestamps = df.time
values = []
for timestamp in timestamps:
    filtered_df = df.loc[df.time <= timestamp]
    eventA = filtered_df.groupby("id").last().groupby("event").count().["time"]["A"]
    values.append({"time": timestamp, "eventA": eventA})

df_count = pd.DataFrame(values)

In my case though, i have to go over >50,000 rows and this basic approach becomes very inefficient time wise.

Is there a better approach to achieve the desired result? I guess there might be some pandas groupby aggregation methods that could help here, but i found none that helped me.

CodePudding user response:

df.set_index(['time', 'id']).unstack().fillna(method='ffill')\
    .stack().value_counts(['time', 'event']).unstack().fillna(0)

The first line takes care of getting the latest event from each id at each hour by forward-filling the NaNs

                    event               
id                      1    2    3    4
time                                    
2022-03-01 10:00:00     A  NaN  NaN  NaN
2022-03-01 11:00:00     A    B  NaN  NaN
2022-03-01 12:00:00     A    B    A  NaN
2022-03-01 13:00:00     B    B    A  NaN
2022-03-01 14:00:00     B    B    A    A
2022-03-01 15:00:00     B    C    A    A
2022-03-01 16:00:00     A    C    A    A

The second line does the counting and thus

event                  A    B    C
time                              
2022-03-01 10:00:00  1.0  0.0  0.0
2022-03-01 11:00:00  1.0  1.0  0.0
2022-03-01 12:00:00  2.0  1.0  0.0
2022-03-01 13:00:00  1.0  2.0  0.0
2022-03-01 14:00:00  2.0  2.0  0.0
2022-03-01 15:00:00  2.0  1.0  1.0
2022-03-01 16:00:00  3.0  0.0  1.0
  • Related