From a list of logs, i want to get the number of active events at each timestamp for a specific event type.
A sample log input looks like this:
time | id | event |
---|---|---|
2022-03-01 10:00 | 1 | A |
2022-03-01 11:00 | 2 | B |
2022-03-01 12:00 | 3 | A |
2022-03-01 13:00 | 1 | B |
2022-03-01 14:00 | 4 | A |
2022-03-01 15:00 | 2 | C |
2022-03-01 16:00 | 1 | A |
... | ... | ... |
What i want is basically how many ids have event A active at each time in the df, like in the table below.
time | eventA |
---|---|
2022-03-01 10:00 | 1 |
2022-03-01 11:00 | 1 |
2022-03-01 12:00 | 2 |
2022-03-01 13:00 | 1 |
2022-03-01 14:00 | 2 |
2022-03-01 15:00 | 2 |
2022-03-01 16:00 | 3 |
... | ... |
I achieved this with some basic pandas operations:
df = pd.DataFrame(
{
"time": pd.date_range("2022-03-01 10:00", periods=7, freq="H"),
"id": [1, 2, 3, 1, 4, 2, 1],
"event": ["A", "B", "A", "B", "A", "C", "A"],
}
)
timestamps = df.time
values = []
for timestamp in timestamps:
filtered_df = df.loc[df.time <= timestamp]
eventA = filtered_df.groupby("id").last().groupby("event").count().["time"]["A"]
values.append({"time": timestamp, "eventA": eventA})
df_count = pd.DataFrame(values)
In my case though, i have to go over >50,000 rows and this basic approach becomes very inefficient time wise.
Is there a better approach to achieve the desired result? I guess there might be some pandas groupby aggregation methods that could help here, but i found none that helped me.
CodePudding user response:
df.set_index(['time', 'id']).unstack().fillna(method='ffill')\
.stack().value_counts(['time', 'event']).unstack().fillna(0)
The first line takes care of getting the latest event from each id
at each hour by forward-filling the NaN
s
event
id 1 2 3 4
time
2022-03-01 10:00:00 A NaN NaN NaN
2022-03-01 11:00:00 A B NaN NaN
2022-03-01 12:00:00 A B A NaN
2022-03-01 13:00:00 B B A NaN
2022-03-01 14:00:00 B B A A
2022-03-01 15:00:00 B C A A
2022-03-01 16:00:00 A C A A
The second line does the counting and thus
event A B C
time
2022-03-01 10:00:00 1.0 0.0 0.0
2022-03-01 11:00:00 1.0 1.0 0.0
2022-03-01 12:00:00 2.0 1.0 0.0
2022-03-01 13:00:00 1.0 2.0 0.0
2022-03-01 14:00:00 2.0 2.0 0.0
2022-03-01 15:00:00 2.0 1.0 1.0
2022-03-01 16:00:00 3.0 0.0 1.0