I have a dataframe like as shown below
person_id source_system r_diff
1 O NULL
1 O 0
1 O 9
1 O NULL
2 O 574
2 I 20
2 O 135
2 O 0
2 I 21
2 O 2
2 O 0
2 O 0
2 I 12
I would like to select rows based on the criteria below
criteria 1 - pick all rows where source-system = I
criteria 2 - pick prior row (n-1) only when source-system of (n-1)th is O
and diff is zero.
This criteria 2 should be applied only when nth row has source-system = I
. If source-system of (n-1)th is I
, we don't have to do anything because criteria 1 will handle that.
We have to apply both the criteria each person
I tried the below based on SO suggestion but not sure how to make it work
m1 = df['visit_source_value'] == 'I'
m2 = df['diff'] <= 0
m3 = df.groupby('person_id')['diff'].shift(-1) <= 0
df = df1[m1 | m2 | m3]
I expect my output to be like as shown below
2 I 20
2 O 0
2 I 21
2 O 0
2 I 12
CodePudding user response:
I prefer not one line solution, because hard readable if more complicated code, so better is use:
m1 = df['visit_source_value'] == 'I'
m2 = df['r_diff'] <= 0
m3 = df.groupby('person_id')['visit_source_value'].shift(-1) == 'I'
df = df[m1 | (m2 & m3)]
print (df)
person_id visit_source_value r_diff
5 2 I 20.0
7 2 O 0.0
8 2 I 21.0
11 2 O 0.0
12 2 I 12.0
CodePudding user response:
You could do it in one-line:
>>> df[df['source_system'].eq('I') | (df['source_system'].shift(-1).eq('I') & df['r_diff'].le(0))]
person_id source_system r_diff
5 2 I 20.0
7 2 O 0.0
8 2 I 21.0
11 2 O 0.0
12 2 I 12.0
>>>