I got a a column of type object (df.columntype
).
I have selected rows with types 'A'
and 'B'
of interest:
subset = df[df.columntype.isin(['A', 'B'])]
For each 'A'
, there should be a 'B'
after, so the sequence should include the same number of As and Bs in the order A-B-A-B etc. However, the count of df[df.columntype == 'B']
is higher than df[df.columntype == 'A']
. What I want to do is to spot the rows where B comes without its parent A. How can I do this?
Here is a sample of the data subset (see above)
timeframe columntype
2 2021-05-13 A
3 2021-05-13 B
26 2021-05-14 A
27 2021-05-14 B
85 2021-05-15 A
CodePudding user response:
I can think of two ways off the top of my head to achieve this:
Sample df:
df = pd.DataFrame({'type':['B','A','B','A','B','C','B','D','B']})
df
type
0 B
1 A
2 B
3 A
4 B
5 C
6 B
7 D
8 B
First method:
set_B = set(np.where(df.type == 'B')[0])
set_A = set(np.where(df.type.shift(1) == 'A')[0])
orphans = sorted(set(set_B).difference(set_A))
# or:
# orphans = list(set(set_B).difference(set_A))
# orphans.sort()
df.iloc[orphans,:]
type
0 B
8 B
6 B
Second method:
subset = df[df.type.isin(['A', 'B'])]
arr_bool = np.ediff1d(subset.index) != 1
arr_bool = np.insert(arr_bool,0,df.type[0]=='B')
non_consec = subset[arr_bool]
orphans = non_consec[non_consec.type == 'B']
orphans
type
0 B
6 B
8 B