Home > OS >  Dataframe fillna with series
Dataframe fillna with series

Time:02-05

I am updating the missing values in dataframe with another column from the same dataframe. But I can't understand the behaviour.

    0              1        2
0   NaN         0.076733    0.378676
1   NaN         0.223911    NaN
2   NaN         0.173071    0.534397
3   NaN         0.991686    0.381196
4   0.088309    0.237683    0.003508
5   0.751860    0.494204    0.757413
6   0.630420    0.192947    0.538492

I am updating the column 1 & column 2 with column 0 series.

df.fillna(df[1])

I would expect it to align with the label index, but it fills with a scalar value from the series instead by the series itself.

    0           1           2
0   0.076733    0.076733    0.378676
1   0.076733    0.223911    0.173071
2   0.076733    0.173071    0.534397
3   0.076733    0.991686    0.381196
4   0.088309    0.237683    0.003508
5   0.751860    0.494204    0.757413
6   0.630420    0.192947    0.538492

Edit: I would expect it to output like this:

    0           1           2
0   0.076733    0.076733    0.378676
1   0.223911    0.223911    0.223911
2   0.173071    0.173071    0.534397
3   0.991686    0.991686    0.381196
4   0.088309    0.237683    0.003508
5   0.751860    0.494204    0.757413
6   0.630420    0.192947    0.538492

Can somebody please help explain what's going on here?

Reedit I found a way where pandas follow what I want to do - passing a dictionary for each column, which seems quite verbose.

df.fillna({0:df[1],2:df[1]})

CodePudding user response:

specify method and axis

this will do

import pandas as pd
df=pd.DataFrame({0:[1,2,3,4,5,6],1:[None,None,None,None,5,6],2:[None,None,3,4,5,6]})

df.fillna(method='ffill', axis=1)
print(df)

    0   1   2
0   1.0 1.0 1.0
1   2.0 2.0 2.0
2   3.0 3.0 3.0
3   4.0 4.0 4.0
4   5.0 5.0 5.0
5   6.0 6.0 6.0

for arbitrary columns

df[1]=np.where(df[1].isnull(),df[0],df[1])
df[2]=np.where(df[2].isnull(),df[0],df[2])

CodePudding user response:

It's filling your NA's according to first column (column 0 a.k.a df[0]).

        0           1           2
0   *0.895575    0.522721    0.012833
1   **0.522721    0.522721    0.012833
2   ***0.012833    0.522721    0.558843
3   0.258442    0.522721    0.772859
4   0.900045    0.026117    0.720966
5   0.913345    0.677905    0.501755
6   0.907725    0.080543    0.881279

So if you had NA's in the first column, it would be replaced by the value of the first row of df[0]. In your example, that value would be 0.895575.

For NA's in your second column (df[1]), it's using the the second row of the column you specified (df[0], the first column). So, all NA's are filled with 0.522721

For NA's in your third column (df[2]), it's using the the third row of the column you specified (df[0], the first column). So, all NA's are filled with 0.012833

Hope this helps!

Edit: I suspect geekay's solution will accomplish what you had intended:

df.fillna(method='ffill', axis=1)

  • Related