Sum time intervals of sparse time-series data with overlapping events-CodePudding

I've got some time-sorted data which tracks the beginning and end time of different events. For illustration purposes imagine I'm tracking when a set of light bulbs are turning on and off. My data is structured like so:

Bulb ID	Event (on/off)	Time (s)
1	on	2
2	on	5
1	off	6
3	on	8
3	off	10
2	off	14

I want to find the total time that at least one of the bulbs is switched on. So far my best idea is to change the Event column in a binary flag and do a cumsum on that column, then use numpy.diff and numpy.where to find the rows where the sum changes from 1 to 0 or 0 to 1, then pair those up and add the difference in time between those two rows to a total. So something like this:

df["event_flag"] = df["Event (on/off)"].map({"on": 1, "off": -1})
df["cumulative"] = df["event_flag"].cumsum()
df["cumulative"] = df.apply(lambda x: 1 if x >= 1 else 0)

switch_rows = df["Time (s)"][df["cumulative"].diff != 0].tolist()

total_time = 0
for i in range(0, len(switch_rows), 2):
    total_time  = switch_rows[i 1] - switch_rows[i]

This works but it's not very safe, as it's assuming that the data starts and ends with all bulbs off, which is not necessarily the case. Is there a neater and/or safer way to do this, or should i stick with what I have and add checks for what the initial system state is?

CodePudding user response：

Your solution might work but has a lot of ifs and buts. Try pd.pivot_table

pd.pivot_table(data=df,values="Time (s)", columns="Event (on/off)", index="Bulb ID",aggfunc=np.sum)

This can then we used to further calculate stuff.

CodePudding user response：

I figured out a solution using pandas.resample. I take the first two steps of my original solution, then pull out just the cumsum and time columns, set the time column as a timedelta index, then resample to a constant rate, as follows:

df["event_flag"] = df["Event (on/off)"].map({"on": 1, "off": -1})
df["cumulative"] = df["event_flag"].cumsum()

time_data = df[["cumulative"]].set_index(pd.TimedeltaIndex(data=df["Time (s)"], unit="s"))

time_data = time_data.resample("1s").pad()

Once I've got a constant sampling rate I can just count the rows where the value is non-zero.

total_time = time_data[time_data["cumulative"] != 0].count()

If my sampling rate hadn't been in seconds I could have then divided this count by my framerate, e.g. if I was working in intervals of 0.2s then my total time is

total_time = 5 * time_data[time_data["cumulative"] != 0].count()

This solution avoids the issues I had with my first solution and is much less wieldy.