Pandas replace all NaN values with data from a different dataframe-CodePudding

I'm pretty new to Pandas and am kind of stucked with a problem to replace Nan-Values with median values from a different dataframe. The median dataframe has a different form, because I had to group the original df to get the medians.

My main dataframe df1 looks something like this:

      permno    yyyymm  BookLeverage Cash   RoE        ShareIss1Y   ShareIss5Y   SP         date        industry_id     STreversal  Price         Size      ret
541     10006   197101  -1.907577   NaN     0.114616    0.000000    0.051689    1.197606    1971-01-29  37              -4.383562   -3.863358   -12.496377  0.043836
542     10006   197102  -1.907577   NaN     0.114616    0.000000    0.051689    1.220021    1971-02-26  37              0.577428    -3.844814   -12.477833  -0.005774
543     10006   197103  -1.907577   NaN     0.114616    0.000000    0.051689    1.118353    1971-03-31  37              -9.090909   -3.931826   -12.564844  0.090909
544     10006   197104  -1.907577   NaN     0.114616    0.000000    0.051689    NaN         1971-04-30  37              -16.176471  -4.081766   -12.714785  0.161765
545     10006   197105  -1.907577   NaN     0.114616    0.000000    0.051689    1.025366    1971-05-28  37              5.105485    -4.018633   -12.651651  -0.051055

Then I created a new dataframe df2 in which I grouped the former df by the yyyymm and industry_id column, and got the median for each time-industry panel.

The median df2 looks something like this:

                     permno  BookLeverage  Cash       RoE  ShareIss1Y  \
yyyymm industry_id                                                      
197101 01           40957.5     -2.451327   NaN  0.015212   -0.306936   
       10           19254.0     -1.300565   NaN  0.123353   -0.002747   
       12           33081.5     -2.102402   NaN -0.001043   -0.255756   
       13           26470.0     -2.028418   NaN  0.116907   -0.005262   
       14           17830.0     -1.266574   NaN  0.110059   -0.000193   
...                     ...           ...   ...       ...         ...   
202112 80           78633.0     -3.037694   NaN  0.195342         NaN   
       82           52123.0     -3.093551   NaN  0.017580         NaN   
       83           13739.0     -2.802522   NaN  0.021025         NaN   
       87           78667.5     -3.103168   NaN  0.104524         NaN   
       97           91547.0     -3.054443   NaN  0.162610         NaN   

                    ShareIss5Y        SP  STreversal     Price       Size  \
yyyymm industry_id                                                          
197101 01            -7.591944  5.439985   -9.998244 -2.684046 -11.483201   
       10            -1.432833  0.517484   -4.504504 -3.367296 -11.826440   
       12           -20.622667  2.264890  -22.648810 -2.873900 -11.501783   
       13            -0.257821  0.752112   -5.429864 -3.607534 -12.362360   
       14            -0.223948  0.636665  -16.075773 -2.729726 -11.386150   
...                        ...       ...         ...       ...        ...   
202112 80                  NaN       NaN  -10.960198 -4.539740 -16.024733   
       82                  NaN       NaN   -1.664319 -2.740474 -13.882130   
       83                  NaN       NaN   -2.383083 -4.835329 -15.843560   
       87                  NaN       NaN   -5.109321 -4.585741 -15.844537   
       97                  NaN       NaN   -1.535659 -4.487512 -16.339328   

                         ret  
yyyymm industry_id            
197101 01           0.099982  
       10           0.045045  
       12           0.226488  
       13           0.054299  
       14           0.160758  
...                      ...  
202112 80           0.109602  
       82           0.016643  
       83           0.023831  
       87           0.051093  
       97           0.015357

What I'm now trying to achieve, is to fill the NaN-values in the df1 with the corresponding value from df2. So that for example the SP column in row 544 would get the value which is in df2 at yyyymm 197104 with industry_id 37.

I tried to map over all rows and inside that over all columns and replace the NaN-values, but this broke my dataframe:

def fill_nan_with_median(row):
    date = int(row['yyyymm'])
    industry = row['industry_id']


    for label, column in row.items():
        if column == np.nan:
            median = df_median.loc[(date, industry), label]
            df_1.loc[index, label] = median
    

for index, row in df_1.iterrows():
    fill_nan_with_median(row)

CodePudding user response：

This is all done without data, therefore you may need to change something (hopefully not),

df_grouped_median = df1.groupby(['yyyymm', 'industry_id'], as_index=False).SP.median().rename(
    columns={"SP":"median"})
df = df.merge(df_grouped_median, on=['yyyymm', 'industry_id'], how='left')
df['SP'].fillna(df['median'])

CodePudding user response：

This answer takes a table lookup approach. For NaNs in the SP column it does a lookup into df2 for the median SP value. This answer also assumes that yyyymm and industry_id are strings and not numeric.

df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)

541    1.197606
542    1.220021
543    1.118353
544    0.636665
545    1.025366

Note that non-NaNs are detected by the weird looking x['SP']==x['SP'] leveraging the fact that NaN != NaN.

Your df1 was used along with a df2 that I created:

                          SP
yyyymm industry_id          
197104 01           5.439985
       10           0.517484
       12           2.264890
       13           0.752112
       37           0.636665

All that you need to do after that is assign that back to the df1 frame:

df1.assign(SP=df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1))

     permno  yyyymm  BookLeverage  Cash       RoE  ShareIss1Y  ShareIss5Y  \
541   10006  197101     -1.907577   NaN  0.114616         0.0    0.051689   
542   10006  197102     -1.907577   NaN  0.114616         0.0    0.051689   
543   10006  197103     -1.907577   NaN  0.114616         0.0    0.051689   
544   10006  197104     -1.907577   NaN  0.114616         0.0    0.051689   
545   10006  197105     -1.907577   NaN  0.114616         0.0    0.051689   

           SP        date industry_id  STreversal     Price       Size  \
541  1.197606  1971-01-29          37   -4.383562 -3.863358 -12.496377   
542  1.220021  1971-02-26          37    0.577428 -3.844814 -12.477833   
543  1.118353  1971-03-31          37   -9.090909 -3.931826 -12.564844   
544  0.636665  1971-04-30          37  -16.176471 -4.081766 -12.714785   
545  1.025366  1971-05-28          37    5.105485 -4.018633 -12.651651   

          ret  
541  0.043836  
542 -0.005774  
543  0.090909  
544  0.161765  
545 -0.051055

Or by:

df1['SP'] = df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)