i tried the following code to replace outlier values from each row to NaN, but this return the entire dataframe with NaN, what am i doing wrong?
anyone can help?
data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0],
['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
df = pd.DataFrame(data, columns=['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12'])
df
range = list(df.columns.values)[1:13]
q1 = df[range].quantile(0.25, axis=1)
q3 = df[range].quantile(0.75, axis=1)
iqr = q3-q1 #interquartile range
min = df.min(axis=1)
max = q3 3*iqr
df_filtered = df[(df[range] > min) & (df[range] < max)]
df_filtered
CodePudding user response:
data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0],
['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
columns = ['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12']
df = pd.DataFrame(data, columns=columns).set_index('MATRÍCULA')
df
Here I am setting the 'MATRÍCULA'
column as an index with .set_index('MATRÍCULA')
.
This way, you won't have to select all the other columns every time. Alternatively, you could create a view and use it: only_values_df = df.iloc[:, 1:]
.
Here, convert min_vals
and max_vals
to numpy arrays for comparison with the dataframe.
If I left them as Series, I would get this Warning:
FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version.
and, actually, it would produce a wrong result, where all the values of the DataFrame would be False
! This was the source of your problem.
q1 = df[cols_range].quantile(0.25, axis=1)
q3 = df[cols_range].quantile(0.75, axis=1)
iqr = q3 - q1 # interquartile range
min_vals = df.min(axis=1).to_numpy().reshape(-1,1)
max_vals = (q3 3*iqr).to_numpy().reshape(-1,1)
Use >=
and <=
instead of >
and <
:
df_filtered = df[(df >= min_vals) & (df <= max_vals)]
df_filtered
CodePudding user response:
That works for me
df_filtered = df[range][df[range].ge(min, axis=0) & df[range].le(max, axis=0)]
If you run df[range] > min
or df[range] < max
you will see that the output is only false dataframe. false & false = false
, so no element will be taken from df
and you get dataframe with Nan values only. Not sure why it's the case (comparing DataFrame and Series must work like that, I'm not an expert in pandas).
Instead, use methods le
and ge
(you need non-strict inequality, because you want to keep values like 0).
BTW. try not to use variable names like range
, min
, max
or other build-in functions or keywords
CodePudding user response:
Use apply
and where
.
Your df:
>>> df
MATRÍCULA V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 ANJSHD12 140 8 99992 0 0 0 0 1 99999 0 0 0
1 ANJSHD15 10 0 0 0 0 0 0 0 0 0 0 0
2 ANJSHD17 19 18 22 19 25 18 23 22 22 17 16 19
Using where
within apply
by column and reassigning your range. (renamed with a trailing underscore to avoid overwriting the builtin)
df[range_] = df[range_].apply(lambda col: col.where((col>=min_)&(col<=max_)))
output
>>> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 140.0 8 NaN 0 0 0 0 1 NaN 0 0 0
1 NaN 0 0.0 0 0 0 0 0 0.0 0 0 0
2 19.0 18 22.0 19 25 18 23 22 22.0 17 16 19
CodePudding user response:
I think you'll benefit from choosing a different Interpolation method for your quantile
calculation.
# You overcomplicate how this works:
cols = df.columns[1:13]
# Note how I put `0.75` in `[]`, this changes how the output is returned.
q3 = df[cols].quantile([0.75], axis=1, interpolation='higher')
# Less than or equal to...
mask = (df[cols].le(q3.T.values)
# There's no reason to have this next line if you're just going to use the min...
& df[cols].ge(df[cols].min(axis=1).to_frame().values))
# Overwrite outlier values:
df[cols] = df[cols][mask]
print(df)
Output:
MATRÍCULA V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
0 ANJSHD12 140.0 8 NaN 0 0.0 0 0.0 1 NaN 0 0 0
1 ANJSHD15 NaN 0 0.0 0 0.0 0 0.0 0 0.0 0 0 0
2 ANJSHD17 19.0 18 22.0 19 NaN 18 NaN 22 22.0 17 16 19
Assuming 'MATRÍCULA'
are unique, We can simplify this all with a groupby:
df[cols] = (df.groupby('MATRÍCULA')[cols]
.apply(lambda x: x[x.le(x.quantile([0.75], axis=1, interpolation='higher').values)]))
print(df)
# Output:
# Same as Above!