I have this dataset of active subjects during specified time-periods.
start end name
0 00:00 00:10 a
1 00:10 00:20 b
2 00:00 00:20 c
3 00:00 00:10 d
4 00:10 00:15 e
5 00:15 00:20 a
The intervals are inclusive on the left(start) side and not inclusive on the right(end).
There are always three subjects active. I want to increase the granularity of the data, so that I will have info of the three active subjects for each second. Each second has three unique values.
This would be the desired result for the test case.
slot1 slot2 slot3
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 b c a
16 b c a
17 b c a
18 b c a
19 b c a
The order of the subjects inside the slots is irrelevant for now. The subjects can reappear in the data like "a" from 00:00 to 00:10 and then again from 00:15 to 00:20. The intervals can be at any second.
CodePudding user response:
Route 1: One (costly but easy) way is to explode the data to the seconds, then merge 3 times:
time_df = (('00:' df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
.apply(lambda x: np.arange(*x), axis=1)
.to_frame('time')
.assign(slot=df['name'])
.explode('time')
)
(time_df.merge(time_df, on='time', suffixes=['1','2'])
.query('slot1 < slot2')
.merge(time_df, on='time')
.query('slot2 < slot')
)
Output:
time slot1 slot2 slot
2 0 a c d
11 1 a c d
20 2 a c d
29 3 a c d
38 4 a c d
47 5 a c d
56 6 a c d
65 7 a c d
74 8 a c d
83 9 a c d
92 10 b c e
101 11 b c e
110 12 b c e
119 13 b c e
128 14 b c e
139 15 a b c
148 16 a b c
157 17 a b c
166 18 a b c
175 19 a b c
Route 2: Another way is to cross merge then query the overlapping intervals:
df[['start','end']] = (('00:' df[['start','end']])
.apply(lambda x: pd.to_timedelta(x).dt.total_seconds())
.astype(int)
)
(df.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_x < name_y')
[['name_x','name_y','start','end']]
.merge(df, how='cross')
.assign(start=lambda x: x.filter(like='start').max(axis=1),
end=lambda x: x.filter(like='end').min(axis=1))
.query('start < end & name_y < name')
[['start','end', 'name_x','name_y', 'name']]
)
Output:
start end name_x name_y name
3 0 10 a c d
16 10 15 b c e
38 15 20 a b c
As you can see the this output is just the same as the other, but in the original form. Depending on your data, one route might better than the other.
Update Since your data has exactly 3 slot at any time, you can easily do with pivot
. This is the best solution.
# time_df as in Route 1
(time_df.sort_values(['time','slot'])
.assign(nums = lambda x: np.arange(len(x)) % 3)
.pivot('time', 'nums', 'slot')
)
# in general, `.assign(nums=lambda x: x.groupby('time').cumcount()`
# also works instead of the above
Output:
nums 0 1 2
time
0 a c d
1 a c d
2 a c d
3 a c d
4 a c d
5 a c d
6 a c d
7 a c d
8 a c d
9 a c d
10 b c e
11 b c e
12 b c e
13 b c e
14 b c e
15 a b c
16 a b c
17 a b c
18 a b c
19 a b c