Home > Net >  Pandas: replacing outlier values based with median values grouped by a timeframe
Pandas: replacing outlier values based with median values grouped by a timeframe

Time:08-09

I have a pandas DataFrame of hourly financial valuations with some outlier values. The outliers have already been calculated and flagged in one of the dataframe's columns.

So the dataframe looks something like this, with each asset having 24 unique entries per day (the integer following the date in the "hour" column is the hour of the day the "value" was calculated):

asset value hour outlier?
AAAA 5.1 1/1/21 1 no
BBBB 7.2 1/1/21 1 no
CCCC 9999 1/1/21 1 yes
AAAA 5.1 1/1/21 2 no
BBBB 7.2 1/1/21 2 no
CCCC 15.0 1/1/21 2 no
.... ... ... ...
AAAA 5.3 1/2/21 1 no
BBBB 9999 1/2/21 1 yes
CCCC 15.6 1/2/21 1 no
... ... ... ...
AAAA 8.1 9/1/21 20 no
BBBB 9.5 9/1/21 20 no
CCCC 9999 9/1/21 20 yes

I want to replace the "value" of the outlier rows with the median values of the same financial asset from the same day.

Essentially, I want to overwrite the outlier values with the median values grouped by the asset and the hour.

I've tried doing this a bunch of ways and have only gotten errors. I appreciate the help!

CodePudding user response:

This approach first creates a new column of dates by pulling the 1/1/21 out of the 1/1/21 1 for example. Then using the new day and the asset column, it calculates the fill_vals which are the day/asset median values of the non-outliers. Finally it uses these values to fill in the outliers

import pandas as pd

#Reading in part of your table
df = pd.DataFrame({
    'asset': ['AAAA', 'BBBB', 'CCCC', 'AAAA', 'BBBB', 'CCCC'],
    'value': [5.1, 7.2, 9999.0, 5.1, 7.2, 15.0],
    'hour': ['1/1/21 1','1/1/21 1','1/1/21 1','1/1/21 2','1/1/21 2','1/1/21 2'],
    'outlier?': ['no', 'no', 'yes', 'no', 'no', 'no']
})

#Create a new column of the day
df['day'] = df['hour'].str.split(' ').str[0]

fill_vals = df[df['outlier?'].eq('no')].groupby(['asset','day'])['value'].median()

#Reindex the table by asset/day, fill in the outliers, then reset the index
df = df.set_index(['asset','day'])
df.loc[df['outlier?'].eq('yes'),'value'] = fill_vals
df = df.reset_index()

df

CodePudding user response:

# If you want a boolean, use a boolean.
df['outlier?'] = df['outlier?'].replace(['yes', 'no'], [True, False])

# NaN your outliers:
df.loc[df['outlier?'], 'value'] = np.nan

# Fix your Date Column:
df[['date', 'hour']] = df.hour.str.split(expand=True)
df.date = pd.to_datetime(df.date)
df.hour = pd.to_timedelta(df.hour.astype(int), unit='h')
df.date  = df.hour
df = df.drop('hour', axis=1)

# Fill NaN values by asset and date:
df['value'] = df.groupby(['asset', df.date.dt.date])['value'].apply(lambda x: x.fillna(x.median()))

print(df)

Output:

  asset  value  outlier?                date
0  AAAA    5.1     False 2021-01-01 01:00:00
1  BBBB    7.2     False 2021-01-01 01:00:00
2  CCCC   15.0      True 2021-01-01 01:00:00
3  AAAA    5.1     False 2021-01-01 02:00:00
4  BBBB    7.2     False 2021-01-01 02:00:00
5  CCCC   15.0     False 2021-01-01 02:00:00
  • Related