how can i replace outliers values of each row in dataframe with NaN?-CodePudding

i tried the following code to replace outlier values from each row to NaN, but this return the entire dataframe with NaN, what am i doing wrong?

the code i'm trying

the result should be this

anyone can help?

data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
    ['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0], 
    ['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
df = pd.DataFrame(data, columns=['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12'])
df

range = list(df.columns.values)[1:13]

q1 = df[range].quantile(0.25, axis=1)
q3 = df[range].quantile(0.75, axis=1)
iqr = q3-q1 #interquartile range
min  = df.min(axis=1)
max = q3 3*iqr
df_filtered = df[(df[range] > min) & (df[range] < max)]
df_filtered

CodePudding user response：

data = [['ANJSHD12', 140, 8, 99992, 0, 0, 0, 0, 1, 99999, 0,0, 0],
        ['ANJSHD15',10, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0], 
        ['ANJSHD17',19, 18, 22, 19, 25, 18, 23, 22, 22, 17,16, 19]]
columns = ['MATRÍCULA','V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10','V11', 'V12']
df = pd.DataFrame(data, columns=columns).set_index('MATRÍCULA')
df

Here I am setting the 'MATRÍCULA' column as an index with .set_index('MATRÍCULA').

This way, you won't have to select all the other columns every time. Alternatively, you could create a view and use it: only_values_df = df.iloc[:, 1:].

Here, convert min_vals and max_vals to numpy arrays for comparison with the dataframe.

If I left them as Series, I would get this Warning:

FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version.

and, actually, it would produce a wrong result, where all the values of the DataFrame would be False! This was the source of your problem.

q1 = df[cols_range].quantile(0.25, axis=1)
q3 = df[cols_range].quantile(0.75, axis=1)
iqr = q3 - q1 # interquartile range

min_vals = df.min(axis=1).to_numpy().reshape(-1,1)
max_vals = (q3   3*iqr).to_numpy().reshape(-1,1)

Use >= and <= instead of > and <:

df_filtered = df[(df >= min_vals) & (df <= max_vals)]
df_filtered

CodePudding user response：

That works for me

df_filtered = df[range][df[range].ge(min, axis=0) & df[range].le(max, axis=0)]

If you run df[range] > min or df[range] < max you will see that the output is only false dataframe. false & false = false, so no element will be taken from df and you get dataframe with Nan values only. Not sure why it's the case (comparing DataFrame and Series must work like that, I'm not an expert in pandas). Instead, use methods le and ge (you need non-strict inequality, because you want to keep values like 0).

BTW. try not to use variable names like range, min, max or other build-in functions or keywords

CodePudding user response：

Use apply and where.

Your df:

>>> df
  MATRÍCULA   V1  V2     V3  V4  V5  V6  V7  V8     V9  V10  V11  V12
0  ANJSHD12  140   8  99992   0   0   0   0   1  99999    0    0    0
1  ANJSHD15   10   0      0   0   0   0   0   0      0    0    0    0
2  ANJSHD17   19  18     22  19  25  18  23  22     22   17   16   19

Using where within apply by column and reassigning your range. (renamed with a trailing underscore to avoid overwriting the builtin)

df[range_] = df[range_].apply(lambda col: col.where((col>=min_)&(col<=max_)))

output

>>> df
      V1  V2    V3  V4  V5  V6  V7  V8    V9  V10  V11  V12
0  140.0   8   NaN   0   0   0   0   1   NaN    0    0    0
1    NaN   0   0.0   0   0   0   0   0   0.0    0    0    0
2   19.0  18  22.0  19  25  18  23  22  22.0   17   16   19

CodePudding user response：

I think you'll benefit from choosing a different Interpolation method for your quantile calculation.

# You overcomplicate how this works:
cols = df.columns[1:13]

# Note how I put `0.75` in `[]`, this changes how the output is returned.
q3 = df[cols].quantile([0.75], axis=1, interpolation='higher')

# Less than or equal to...
mask = (df[cols].le(q3.T.values)
   # There's no reason to have this next line if you're just going to use the min... 
      & df[cols].ge(df[cols].min(axis=1).to_frame().values))

# Overwrite outlier values:
df[cols] = df[cols][mask]
print(df)

Output:

  MATRÍCULA     V1  V2    V3  V4   V5  V6   V7  V8    V9  V10  V11  V12
0  ANJSHD12  140.0   8   NaN   0  0.0   0  0.0   1   NaN    0    0    0
1  ANJSHD15    NaN   0   0.0   0  0.0   0  0.0   0   0.0    0    0    0
2  ANJSHD17   19.0  18  22.0  19  NaN  18  NaN  22  22.0   17   16   19

Assuming 'MATRÍCULA' are unique, We can simplify this all with a groupby:

df[cols] = (df.groupby('MATRÍCULA')[cols]
              .apply(lambda x: x[x.le(x.quantile([0.75], axis=1, interpolation='higher').values)]))
print(df)

# Output:
# Same as Above!