For example I want to convert this list
x=[False, True, True, True, True, False, True, True, False, True]
to a ranges (start and end locations) of True
values
[[1,4],
[6,7],
[9,9]]
This is obviously possible using a for
loop. However, I am looking for a other options that are faster and better (one-liners are welcome e.g. maybe a list comprehension). Ideally, I am looking for some way that could also be applicable to a pandas
series.
CodePudding user response:
A solution with Pandas
only:
s = pd.Series(x)
grp = s.eq(False).cumsum()
arr = grp.loc[s.eq(True)] \
.groupby(grp) \
.apply(lambda x: [x.index.min(), x.index.max()])
Output:
>>> arr
1 [1, 4]
2 [6, 7]
3 [9, 9]
dtype: object
>>> arr.tolist()
[[1, 4], [6, 7], [9, 9]]
Alternative:
start = s[s.eq(True) & s.shift(1).eq(False)].index
end = s[s.eq(True) & s.shift(-1, fill_value=False).eq(False)].index
print(list(zip(start, end)))
# Output:
[(1, 4), (6, 7), (9, 9)]
Performance*
# Solution 1
>>> %timeit s.eq(False).cumsum().loc[s.eq(True)].groupby(s.eq(False).cumsum()).apply(lambda x: [x.index.min(), x.index.max()])
1.22 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Solution 2
>>> %timeit list(zip(s[s.eq(True) & s.shift(1).eq(False)].index, s[s.eq(True) & s.shift(-1, fill_value=False).eq(False)].index))
603 µs ± 2.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CodePudding user response:
Option with numpy
. We can check if previous value is False and current value is True, then it's the start of True sequence. On the other hand, if previous value is True and current value is False, then it's the end of True sequence.
z = np.concatenate(([False], x, [False]))
start = np.flatnonzero(~z[:-1] & z[1:])
end = np.flatnonzero(z[:-1] & ~z[1:])
np.column_stack((start, end-1))
array([[1, 4],
[6, 7],
[9, 9]], dtype=int32)
CodePudding user response:
Here's a solution that uses scipy
and pandas
:
import pandas as pd
import scipy as sc
def boolean_vector2ranges(x):
df1=pd.DataFrame({'location':range(len(l)),
'bool':x,
})
df1['group']=sc.ndimage.measurements.label(df1['bool'].astype(int))[0]
return df1.loc[(df1['group']!=0),:].groupby('group')['location'].agg([min,max])
boolean_vector2ranges(x=[False, True, True, True, True, False, True, True, False, True])
returns,