Home > OS >  create dataframe with outliers and then replace with nan
create dataframe with outliers and then replace with nan

Time:09-18

I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria. I also want in the function the value of the column "first_column" corresponding to the outlier.

For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :

I start with this dataframe:
enter image description here

  import pandas as pd
        
    data = {'first_column':  [product_name', 'product_name2', 'product_name3'],
                'second_column': ['first_value', 'second_value', 'third_value'],
                 'third_100':['89', '9', '589'],
                'fourth_100':['25', '1568200', '5''],
                }
        
    df = pd.DataFrame(data)
    
    print (df)

expected output: enter image description here

CodePudding user response:

IIUC, you can use filter and boolean indexing:

# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)

# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))

# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))

# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column'] list(df.filter(like='100'))]

output 1:

    first_column second_column third_100 fourth_100
0   product_name   first_value        89         25
1  product_name2  second_value         9        NaN
2  product_name3   third_value       NaN          5

output 2:

    first_column third_100 fourth_100
1  product_name2         9    1568200
2  product_name3       589          5
  • Related