Get occurences of at least n consecutive rows meeting a specific condition?-CodePudding

I have a dataframe df with a binary column target, and I want to compute the occurrences of at least n consecutive rows such that df[target] == 1.

I found a lot of answers about computing the occurrences of (exactly) n consecutive rows meeting a given condition on some dataframe columns. But they don't solve my problem.

I could use the fact that the target is finite to build the following algorithm to solve my problem :

target = [0,0,1,1,1,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1]
df = pd.DataFrame(
    {"target" : target}
)
n = 3
groups = []
for i in range(df.size):
    if df["target"].iloc[i] == 0:
        continue
    group_index_min = df.index[i]
    for j in range(i, df.size):
        if df["target"].iloc[j] == 1:
            group_index_max = df.index[j]
        else:
            break
    current_group = (group_index_min, group_index_max)
    is_sub_group = False
    for group in groups:
        a, b = group
        if a <= group_index_min and group_index_max <= b:
            is_sub_group = True
    if (not is_sub_group) and (group_index_max - group_index_min   1 >= n):
        groups.append(current_group)

groups
# >> [(2, 4), (9, 11), (31, 36)]

But, I would prefer a solution using the power of numpy or pandas, more pythonic.

Can someone help me? Many thanks!

CodePudding user response：

Try:

get_group = lambda x: (x.index[0], x.index[-1]) if len(x) >= 3 else None

groups = df['target'].eq(0).cumsum()[df['target'].ne(0)].to_frame() \
                     .groupby('target').apply(get_group).dropna().tolist()
print(groups)

# Output:
[(2, 4), (9, 11), (31, 36)]

CodePudding user response：

# find index values where target goes from 0 to 1 or from 1 to 0
change_points = df[(df.target == 1) & ((df.target.shift(fill_value=0) == 0) | (df.target.shift(-1, fill_value=0) == 0))].index

# group change points into pairs, e.g. [1, 3, 4, 10] -> [(1, 3), (4, 10)]
groups_ = list(zip(change_points[::2], change_points[1::2]))

# keep only groups of minimal length
groups = [(a, b) for a, b in groups_ if b - a   1 >= n]

The value of groups for the provided input data is [(2, 4), (9, 11), (31, 36)].