I've got some time-sorted data which tracks the beginning and end time of different events. For illustration purposes imagine I'm tracking when a set of light bulbs are turning on and off. My data is structured like so:
Bulb ID | Event (on/off) | Time (s) |
---|---|---|
1 | on | 2 |
2 | on | 5 |
1 | off | 6 |
3 | on | 8 |
3 | off | 10 |
2 | off | 14 |
I want to find the total time that at least one of the bulbs is switched on. So far my best idea is to change the Event column in a binary flag and do a cumsum on that column, then use numpy.diff
and numpy.where
to find the rows where the sum changes from 1 to 0 or 0 to 1, then pair those up and add the difference in time between those two rows to a total. So something like this:
df["event_flag"] = df["Event (on/off)"].map({"on": 1, "off": -1})
df["cumulative"] = df["event_flag"].cumsum()
df["cumulative"] = df.apply(lambda x: 1 if x >= 1 else 0)
switch_rows = df["Time (s)"][df["cumulative"].diff != 0].tolist()
total_time = 0
for i in range(0, len(switch_rows), 2):
total_time = switch_rows[i 1] - switch_rows[i]
This works but it's not very safe, as it's assuming that the data starts and ends with all bulbs off, which is not necessarily the case. Is there a neater and/or safer way to do this, or should i stick with what I have and add checks for what the initial system state is?
CodePudding user response:
Your solution might work but has a lot of ifs and buts. Try pd.pivot_table
pd.pivot_table(data=df,values="Time (s)", columns="Event (on/off)", index="Bulb ID",aggfunc=np.sum)
This can then we used to further calculate stuff.
CodePudding user response:
I figured out a solution using pandas.resample
. I take the first two steps of my original solution, then pull out just the cumsum and time columns, set the time column as a timedelta index, then resample to a constant rate, as follows:
df["event_flag"] = df["Event (on/off)"].map({"on": 1, "off": -1})
df["cumulative"] = df["event_flag"].cumsum()
time_data = df[["cumulative"]].set_index(pd.TimedeltaIndex(data=df["Time (s)"], unit="s"))
time_data = time_data.resample("1s").pad()
Once I've got a constant sampling rate I can just count the rows where the value is non-zero.
total_time = time_data[time_data["cumulative"] != 0].count()
If my sampling rate hadn't been in seconds I could have then divided this count by my framerate, e.g. if I was working in intervals of 0.2s then my total time is
total_time = 5 * time_data[time_data["cumulative"] != 0].count()
This solution avoids the issues I had with my first solution and is much less wieldy.