pandas listing same indexes-CodePudding

if a table has the same index 3 times in a row, I want it to fetch me this dataframe.

example

index  var1
1        a     
2        b    
2        c
2        d
3        e
2        f
5        g
2        f

After the code

expected output

index  var1
2       b
2       c
2       d

CodePudding user response：

One option is to split data frame on the diff index, check size of each chunk and filter out chunks with sizes smaller then threshold and then recombine them:

import pandas as pd
import numpy as np
diff_indices = np.flatnonzero(df['index'].diff().ne(0))

diff_indices
# array([0, 1, 4, 5, 6, 7], dtype=int32)

pd.concat([chunk for chunk in np.split(df, diff_indices) if len(chunk) >= 3])
   index var1
1      2    b
2      2    c
3      2    d

CodePudding user response：

Let us identify the blocks of consecutive indices using cumsum, then group and transform with count to find the size of each block then select the rows where the block size > 2

b = df['index'].diff().ne(0).cumsum()
df[b.groupby(b).transform('count') > 2]

   index var1
1      2    b
2      2    c
3      2    d

CodePudding user response：

You can assign consecutive rows to same value by comparing with next and cumsum. Then groupby consecutive rows and keep the group where number of rows are 3 times

m = df['index'].ne(df['index'].shift()).cumsum()
out = df.groupby(m).filter(lambda col: len(col) == 3)

print(out)

   index var1
1      2    b
2      2    c
3      2    d

CodePudding user response：

Here's one more solution on top of the ones above (this one is more generalizable, since it selects ALL slices that meet the given criterium):

import pandas as pd

df['diff_index'] = df['index'].diff(-1) # calcs the index diff
df = df.fillna(999) # get rid of NaNs
df['diff_index'] = df['diff_index'].astype(int) # convert the diff to int

df_selected = [] # create a list of all dfs we're going to slice

l = list(df['diff_index'])

for i in range(len(l)-1):
    if l[i] == 0 and l[i 1] == 0: # if 2 consecutive 0s are found, get the slice
        df_temp = df[df.index.isin([i,i 1,i 2])]
        del df_temp['diff_index']
        df_selected.append(df_temp) # append the slice to our list 

print(df_selected) # list all identified data frames (in your example, there will be only one

[   index var1
 1      2    b
 2      2    c
 3      2    d]