I have 2 dataframes, the first is a small dataframe (df1) with information to use to fill a field (named Flag) of the second dataframe (df2). I need to write a function that uses each row of df1 as parameters to fill each row of df2 with a certain value (Y or N).
df1 =
type | q25 | q75 |
---|---|---|
A | 13 | 98 |
B | 381 | 500 |
C | 34 | 103 |
df2 =
field1 | field2 | ... | TYPE | delta | Flag |
---|---|---|---|---|---|
field1 | field2 | ... | A | 379 | Y |
field1 | field2 | ... | C | 90 | N |
field1 | field2 | ... | A | 50 | N |
field1 | field2 | ... | B | 2000 | Y |
I tried this code, but unfortunately it doesn't work because it overwrites the lines in df2 related to the cases of df1 with type equal to A or B, and it works for the last record of df2 (type equal to C)
def filling(x, row):
IQR = row['q75'] - row['q25']
if (x['Flag'] != 'Y'):
if row['type'] == x['TYPE'] and (x['delta'] < row['q25'] - 1.5*IQR or x['delta'] > row['q75'] 1.5*IQR):
return 'Y'
else:
return 'N'
for index, row in df2.iterrows():
df1['Flag'] = df1.apply(lambda x : filling(x, row), axis=1)
How could I fix it?
CodePudding user response:
If I got your question correctly, you missed two parenthesis:
def filling(x, row):
IQR = row['q75'] - row['q25']
if (x['Flag'] != 'Y'):
if (row['type'] == x['TYPE'] and (x['delta'] < row['q25'] - 1.5*IQR) or (x['delta'] > row['q75'] 1.5*IQR)): #here, you just needed to divide the expression
return 'Y'
else:
return 'N'
for index, row in df2.iterrows():
df1['Flag'] = df1.apply(lambda x : filling(x, row), axis=1)
field1 field2 TYPE delta Flag
0 field1 field2 A 379 Y
1 field1 field2 C 90 N
2 field1 field2 A 50 N
3 field1 field2 B 2000 Y
CodePudding user response:
From what I understand you would like to have a flag column which tell you whether the particular row is an outlier or not. Here is a vectorized and concise way to achieve that:
# Merge the dataframes on type column
s = df2.merge(df1, left_on='TYPE', right_on='type', how='left')
# calculate IQR and condition to check for outlier
s['IQR'] = s['q75'] - s['q25']
is_outlier = ~s['delta'].between(s['q25'] - 1.5 * s['IQR'], s['q25'] 1.5 * s['IQR'])
# Use np.where to select Y/N based on the outlier condition
s['Flag'] = np.where(s['Flag'].ne('Y') & is_outlier, 'Y', s['Flag'])
# drop the columns from df1
s = s.drop(columns=df1.columns)
Result
print(s)
field1 field2 ... TYPE delta Flag IQR
0 field1 field2 ... A 379 Y 85
1 field1 field2 ... C 90 N 69
2 field1 field2 ... A 50 N 85
3 field1 field2 ... B 2000 Y 119