I have a pandas DataFrame of hourly financial valuations with some outlier values. The outliers have already been calculated and flagged in one of the dataframe's columns.
So the dataframe looks something like this, with each asset having 24 unique entries per day (the integer following the date in the "hour" column is the hour of the day the "value" was calculated):
asset | value | hour | outlier? |
---|---|---|---|
AAAA | 5.1 | 1/1/21 1 | no |
BBBB | 7.2 | 1/1/21 1 | no |
CCCC | 9999 | 1/1/21 1 | yes |
AAAA | 5.1 | 1/1/21 2 | no |
BBBB | 7.2 | 1/1/21 2 | no |
CCCC | 15.0 | 1/1/21 2 | no |
.... | ... | ... | ... |
AAAA | 5.3 | 1/2/21 1 | no |
BBBB | 9999 | 1/2/21 1 | yes |
CCCC | 15.6 | 1/2/21 1 | no |
... | ... | ... | ... |
AAAA | 8.1 | 9/1/21 20 | no |
BBBB | 9.5 | 9/1/21 20 | no |
CCCC | 9999 | 9/1/21 20 | yes |
I want to replace the "value" of the outlier rows with the median values of the same financial asset from the same day.
Essentially, I want to overwrite the outlier values with the median values grouped by the asset and the hour.
I've tried doing this a bunch of ways and have only gotten errors. I appreciate the help!
CodePudding user response:
This approach first creates a new column of dates by pulling the 1/1/21
out of the 1/1/21 1
for example. Then using the new day
and the asset
column, it calculates the fill_vals
which are the day/asset median values of the non-outliers. Finally it uses these values to fill in the outliers
import pandas as pd
#Reading in part of your table
df = pd.DataFrame({
'asset': ['AAAA', 'BBBB', 'CCCC', 'AAAA', 'BBBB', 'CCCC'],
'value': [5.1, 7.2, 9999.0, 5.1, 7.2, 15.0],
'hour': ['1/1/21 1','1/1/21 1','1/1/21 1','1/1/21 2','1/1/21 2','1/1/21 2'],
'outlier?': ['no', 'no', 'yes', 'no', 'no', 'no']
})
#Create a new column of the day
df['day'] = df['hour'].str.split(' ').str[0]
fill_vals = df[df['outlier?'].eq('no')].groupby(['asset','day'])['value'].median()
#Reindex the table by asset/day, fill in the outliers, then reset the index
df = df.set_index(['asset','day'])
df.loc[df['outlier?'].eq('yes'),'value'] = fill_vals
df = df.reset_index()
df
CodePudding user response:
# If you want a boolean, use a boolean.
df['outlier?'] = df['outlier?'].replace(['yes', 'no'], [True, False])
# NaN your outliers:
df.loc[df['outlier?'], 'value'] = np.nan
# Fix your Date Column:
df[['date', 'hour']] = df.hour.str.split(expand=True)
df.date = pd.to_datetime(df.date)
df.hour = pd.to_timedelta(df.hour.astype(int), unit='h')
df.date = df.hour
df = df.drop('hour', axis=1)
# Fill NaN values by asset and date:
df['value'] = df.groupby(['asset', df.date.dt.date])['value'].apply(lambda x: x.fillna(x.median()))
print(df)
Output:
asset value outlier? date
0 AAAA 5.1 False 2021-01-01 01:00:00
1 BBBB 7.2 False 2021-01-01 01:00:00
2 CCCC 15.0 True 2021-01-01 01:00:00
3 AAAA 5.1 False 2021-01-01 02:00:00
4 BBBB 7.2 False 2021-01-01 02:00:00
5 CCCC 15.0 False 2021-01-01 02:00:00