I've a dataset
id | ref | name | conditionCol |
---|---|---|---|
1 | 123 | a | no_error |
1 | 456 | b | error |
1 | 789 | c | no_error |
2 | 231 | d | no_error |
2 | 312 | e | no_error |
2 | 546 | f | no_error |
3 | 645 | g | error |
3 | 879 | h | error |
4 | 789 | i | no_error |
4 | 978 | j | no_error |
I'm trying to create a custom error_flag
, condition being:
- for each unique
id
column elements - if any row in the
conditionCol
has the keyworderror
, then - for each row should be flagged as
yes
in theerror_flag
- if for any element in
id
column - not even a single row has the keyword
error
inconditionCol
column, then - for each row should be flagged as
no
in theerror_flag
E.g. For id
:1, all the values of error_flag
is yes, as for id value 1, row #2 of conditionCol
has error
id | ref | name | conditionCol | error_flag |
---|---|---|---|---|
1 | 123 | a | no_error | yes |
1 | 456 | b | error | yes |
1 | 789 | c | no_error | yes |
But, for id
:2, all the values of error_flag
is no, as for id value 2, no row of conditionCol
has error
id | ref | name | conditionCol | error_flag |
---|---|---|---|---|
2 | 231 | d | no_error | no |
2 | 312 | e | no_error | no |
2 | 546 | f | no_error | no |
Similarly for id
value 3 & 4:
id | ref | name | conditionCol | error_flag |
---|---|---|---|---|
3 | 645 | g | no_error | no |
3 | 879 | h | no_error | no |
4 | 789 | i | error | yes |
4 | 978 | j | error | yes |
And final output being:
id | ref | name | conditionCol | error_flag |
---|---|---|---|---|
1 | 123 | a | no_error | yes |
1 | 456 | b | error | yes |
1 | 789 | c | no_error | yes |
2 | 231 | d | no_error | no |
2 | 312 | e | no_error | no |
2 | 546 | f | no_error | no |
3 | 645 | g | no_error | no |
3 | 879 | h | no_error | no |
4 | 789 | i | error | yes |
4 | 978 | j | error | yes |
Update:
If you wish to play around with the dataset:
import pandas as pd
import numpy as np
id_col = [1,1,1,2,2,2,3,3,4,4]
ref_col = [123,456, 789, 231, 312, 546, 645, 879, 789, 978]
name_col = ['a','b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
conditionCol = ['no_error', 'error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'no_error', 'error', 'error']
df = pd.DataFrame(zip(id_col, ref_col, name_col, conditionCol), columns=['id','ref','name','conditionCol'])
df
update2: Is there a way to work with thresholds, i.e.:
- current question: atleast one occurrence of keyword
error
in conditionCol column for each individual uniqueid
s, then the value inerror_flag
would beyes
for all the rows in that id value - atleast 4 or atleast 5 occurrence of keyword
error
in conditionCol column for uniqueid
s, then only the value inerror_flag
would beyes
for all the rows in that id value
CodePudding user response:
Use numpy.where
with test if at least one value error
per groups by id
:
m = df['id'].isin(df.loc[df['conditionCol'].eq('error'), 'id'])
#alternative
#m = df['conditionCol'].eq('error').groupby(df['id']).transform('any')
df['error_flag'] = np.where(m, 'yes', 'no')
print (df)
id ref name conditionCol error_flag
0 1 123 a no_error yes
1 1 456 b error yes
2 1 789 c no_error yes
3 2 231 d no_error no
4 2 312 e no_error no
5 2 546 f no_error no
6 3 645 g no_error no
7 3 879 h no_error no
8 4 789 i error yes
9 4 978 j error yes