Home > Software engineering >  Dataframe find rows out of sequence based on condition in column
Dataframe find rows out of sequence based on condition in column

Time:06-12

I got a a column of type object (df.columntype). I have selected rows with types 'A' and 'B' of interest:

subset = df[df.columntype.isin(['A', 'B'])]

For each 'A', there should be a 'B' after, so the sequence should include the same number of As and Bs in the order A-B-A-B etc. However, the count of df[df.columntype == 'B'] is higher than df[df.columntype == 'A']. What I want to do is to spot the rows where B comes without its parent A. How can I do this?

Here is a sample of the data subset (see above)

    timeframe  columntype   
2   2021-05-13  A
3   2021-05-13  B   
26  2021-05-14  A
27  2021-05-14  B
85  2021-05-15  A

CodePudding user response:

I can think of two ways off the top of my head to achieve this:

Sample df:

df = pd.DataFrame({'type':['B','A','B','A','B','C','B','D','B']})
df

  type
0    B
1    A
2    B
3    A
4    B
5    C
6    B
7    D
8    B

First method:

set_B = set(np.where(df.type == 'B')[0])
set_A = set(np.where(df.type.shift(1) == 'A')[0])

orphans = sorted(set(set_B).difference(set_A))

# or:
# orphans = list(set(set_B).difference(set_A))
# orphans.sort()

df.iloc[orphans,:]

  type
0    B
8    B
6    B

Second method:

subset = df[df.type.isin(['A', 'B'])]

arr_bool = np.ediff1d(subset.index) != 1
arr_bool = np.insert(arr_bool,0,df.type[0]=='B')

non_consec = subset[arr_bool]
orphans = non_consec[non_consec.type == 'B']
orphans

  type
0    B
6    B
8    B
  • Related