Home > OS >  Search N consecutive rows with same value in one dataframe
Search N consecutive rows with same value in one dataframe

Time:01-21

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this. I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?

Fruit 2 matches 5 matches
Apple No No
NaN No No
Pear No No
Pear Yes No
Pear Yes No
Pear Yes No
Pear Yes Yes
NaN No No
NaN No No
NaN No No
NaN No No
NaN No No
Banana No No
Banana Yes No

Update: testing solutions by @Corralien

counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
            .transform('cumcount').add(1)  # cumulative counter
            .where(df['Fruit'].notna(), other=0))  # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')

VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.

CodePudding user response:

Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:

counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
            .transform('cumcount').add(1)  # cumulative counter
            .where(df['Fruit'].notna(), other=0))  # set NaN to 0

N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})

N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})

Output:

>>> df
     Fruit 2 matches 5 matches
0    Apple        No        No
1      NaN        No        No
2     Pear        No        No
3     Pear       Yes        No
4     Pear       Yes        No
5     Pear       Yes        No
6     Pear       Yes       Yes
7      NaN        No        No
8      NaN        No        No
9      NaN        No        No
10     NaN        No        No
11     NaN        No        No
12  Banana        No        No
13  Banana       Yes        No

>>> counts
0     1
1     0
2     1
3     2
4     3
5     4
6     5
7     0
8     0
9     0
10    0
11    0
12    1
13    2
dtype: int64

Update

if I need to change "Yes" for the fruit name for example

N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)

# Output
     Fruit 2 matches
0    Apple        No
1      NaN        No
2     Pear        No
3     Pear      Pear
4     Pear      Pear
5     Pear      Pear
6     Pear      Pear
7      NaN        No
8      NaN        No
9      NaN        No
10     NaN        No
11     NaN        No
12  Banana        No
13  Banana    Banana
  • Related