I'm pretty new to Pandas and am kind of stucked with a problem to replace Nan-Values with median values from a different dataframe. The median dataframe has a different form, because I had to group the original df to get the medians.
My main dataframe df1 looks something like this:
permno yyyymm BookLeverage Cash RoE ShareIss1Y ShareIss5Y SP date industry_id STreversal Price Size ret
541 10006 197101 -1.907577 NaN 0.114616 0.000000 0.051689 1.197606 1971-01-29 37 -4.383562 -3.863358 -12.496377 0.043836
542 10006 197102 -1.907577 NaN 0.114616 0.000000 0.051689 1.220021 1971-02-26 37 0.577428 -3.844814 -12.477833 -0.005774
543 10006 197103 -1.907577 NaN 0.114616 0.000000 0.051689 1.118353 1971-03-31 37 -9.090909 -3.931826 -12.564844 0.090909
544 10006 197104 -1.907577 NaN 0.114616 0.000000 0.051689 NaN 1971-04-30 37 -16.176471 -4.081766 -12.714785 0.161765
545 10006 197105 -1.907577 NaN 0.114616 0.000000 0.051689 1.025366 1971-05-28 37 5.105485 -4.018633 -12.651651 -0.051055
Then I created a new dataframe df2 in which I grouped the former df by the yyyymm
and industry_id
column, and got the median for each time-industry panel.
The median df2 looks something like this:
permno BookLeverage Cash RoE ShareIss1Y \
yyyymm industry_id
197101 01 40957.5 -2.451327 NaN 0.015212 -0.306936
10 19254.0 -1.300565 NaN 0.123353 -0.002747
12 33081.5 -2.102402 NaN -0.001043 -0.255756
13 26470.0 -2.028418 NaN 0.116907 -0.005262
14 17830.0 -1.266574 NaN 0.110059 -0.000193
... ... ... ... ... ...
202112 80 78633.0 -3.037694 NaN 0.195342 NaN
82 52123.0 -3.093551 NaN 0.017580 NaN
83 13739.0 -2.802522 NaN 0.021025 NaN
87 78667.5 -3.103168 NaN 0.104524 NaN
97 91547.0 -3.054443 NaN 0.162610 NaN
ShareIss5Y SP STreversal Price Size \
yyyymm industry_id
197101 01 -7.591944 5.439985 -9.998244 -2.684046 -11.483201
10 -1.432833 0.517484 -4.504504 -3.367296 -11.826440
12 -20.622667 2.264890 -22.648810 -2.873900 -11.501783
13 -0.257821 0.752112 -5.429864 -3.607534 -12.362360
14 -0.223948 0.636665 -16.075773 -2.729726 -11.386150
... ... ... ... ... ...
202112 80 NaN NaN -10.960198 -4.539740 -16.024733
82 NaN NaN -1.664319 -2.740474 -13.882130
83 NaN NaN -2.383083 -4.835329 -15.843560
87 NaN NaN -5.109321 -4.585741 -15.844537
97 NaN NaN -1.535659 -4.487512 -16.339328
ret
yyyymm industry_id
197101 01 0.099982
10 0.045045
12 0.226488
13 0.054299
14 0.160758
... ...
202112 80 0.109602
82 0.016643
83 0.023831
87 0.051093
97 0.015357
What I'm now trying to achieve, is to fill the NaN-values in the df1 with the corresponding value from df2. So that for example the SP column in row 544 would get the value which is in df2 at yyyymm
197104 with industry_id
37.
I tried to map over all rows and inside that over all columns and replace the NaN-values, but this broke my dataframe:
def fill_nan_with_median(row):
date = int(row['yyyymm'])
industry = row['industry_id']
for label, column in row.items():
if column == np.nan:
median = df_median.loc[(date, industry), label]
df_1.loc[index, label] = median
for index, row in df_1.iterrows():
fill_nan_with_median(row)
CodePudding user response:
This is all done without data, therefore you may need to change something (hopefully not),
df_grouped_median = df1.groupby(['yyyymm', 'industry_id'], as_index=False).SP.median().rename(
columns={"SP":"median"})
df = df.merge(df_grouped_median, on=['yyyymm', 'industry_id'], how='left')
df['SP'].fillna(df['median'])
CodePudding user response:
This answer takes a table lookup approach. For NaNs
in the SP
column it does a lookup into df2
for the median SP
value. This answer also assumes that yyyymm
and industry_id
are strings and not numeric.
df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)
541 1.197606
542 1.220021
543 1.118353
544 0.636665
545 1.025366
Note that non-NaNs are detected by the weird looking x['SP']==x['SP']
leveraging the fact that NaN != NaN
.
Your df1
was used along with a df2
that I created:
SP
yyyymm industry_id
197104 01 5.439985
10 0.517484
12 2.264890
13 0.752112
37 0.636665
All that you need to do after that is assign that back to the df1
frame:
df1.assign(SP=df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1))
permno yyyymm BookLeverage Cash RoE ShareIss1Y ShareIss5Y \
541 10006 197101 -1.907577 NaN 0.114616 0.0 0.051689
542 10006 197102 -1.907577 NaN 0.114616 0.0 0.051689
543 10006 197103 -1.907577 NaN 0.114616 0.0 0.051689
544 10006 197104 -1.907577 NaN 0.114616 0.0 0.051689
545 10006 197105 -1.907577 NaN 0.114616 0.0 0.051689
SP date industry_id STreversal Price Size \
541 1.197606 1971-01-29 37 -4.383562 -3.863358 -12.496377
542 1.220021 1971-02-26 37 0.577428 -3.844814 -12.477833
543 1.118353 1971-03-31 37 -9.090909 -3.931826 -12.564844
544 0.636665 1971-04-30 37 -16.176471 -4.081766 -12.714785
545 1.025366 1971-05-28 37 5.105485 -4.018633 -12.651651
ret
541 0.043836
542 -0.005774
543 0.090909
544 0.161765
545 -0.051055
Or by:
df1['SP'] = df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)