if a table has the same index 3 times in a row, I want it to fetch me this dataframe.
example
index var1
1 a
2 b
2 c
2 d
3 e
2 f
5 g
2 f
After the code
expected output
index var1
2 b
2 c
2 d
CodePudding user response:
One option is to split data frame on the diff index, check size of each chunk and filter out chunks with sizes smaller then threshold and then recombine them:
import pandas as pd
import numpy as np
diff_indices = np.flatnonzero(df['index'].diff().ne(0))
diff_indices
# array([0, 1, 4, 5, 6, 7], dtype=int32)
pd.concat([chunk for chunk in np.split(df, diff_indices) if len(chunk) >= 3])
index var1
1 2 b
2 2 c
3 2 d
CodePudding user response:
Let us identify the blocks of consecutive indices using cumsum
, then group and transform
with count
to find the size of each block then select the rows where the block size > 2
b = df['index'].diff().ne(0).cumsum()
df[b.groupby(b).transform('count') > 2]
index var1
1 2 b
2 2 c
3 2 d
CodePudding user response:
You can assign consecutive rows to same value by comparing with next and cumsum
. Then groupby consecutive rows and keep the group where number of rows are 3 times
m = df['index'].ne(df['index'].shift()).cumsum()
out = df.groupby(m).filter(lambda col: len(col) == 3)
print(out)
index var1
1 2 b
2 2 c
3 2 d
CodePudding user response:
Here's one more solution on top of the ones above (this one is more generalizable, since it selects ALL slices that meet the given criterium):
import pandas as pd
df['diff_index'] = df['index'].diff(-1) # calcs the index diff
df = df.fillna(999) # get rid of NaNs
df['diff_index'] = df['diff_index'].astype(int) # convert the diff to int
df_selected = [] # create a list of all dfs we're going to slice
l = list(df['diff_index'])
for i in range(len(l)-1):
if l[i] == 0 and l[i 1] == 0: # if 2 consecutive 0s are found, get the slice
df_temp = df[df.index.isin([i,i 1,i 2])]
del df_temp['diff_index']
df_selected.append(df_temp) # append the slice to our list
print(df_selected) # list all identified data frames (in your example, there will be only one
[ index var1
1 2 b
2 2 c
3 2 d]