I am updating the missing values in dataframe with another column from the same dataframe. But I can't understand the behaviour.
0 1 2
0 NaN 0.076733 0.378676
1 NaN 0.223911 NaN
2 NaN 0.173071 0.534397
3 NaN 0.991686 0.381196
4 0.088309 0.237683 0.003508
5 0.751860 0.494204 0.757413
6 0.630420 0.192947 0.538492
I am updating the column 1
& column 2
with column 0
series.
df.fillna(df[1])
I would expect it to align with the label index, but it fills with a scalar value from the series instead by the series itself.
0 1 2
0 0.076733 0.076733 0.378676
1 0.076733 0.223911 0.173071
2 0.076733 0.173071 0.534397
3 0.076733 0.991686 0.381196
4 0.088309 0.237683 0.003508
5 0.751860 0.494204 0.757413
6 0.630420 0.192947 0.538492
Edit: I would expect it to output like this:
0 1 2
0 0.076733 0.076733 0.378676
1 0.223911 0.223911 0.223911
2 0.173071 0.173071 0.534397
3 0.991686 0.991686 0.381196
4 0.088309 0.237683 0.003508
5 0.751860 0.494204 0.757413
6 0.630420 0.192947 0.538492
Can somebody please help explain what's going on here?
Reedit I found a way where pandas follow what I want to do - passing a dictionary for each column, which seems quite verbose.
df.fillna({0:df[1],2:df[1]})
CodePudding user response:
specify method and axis
this will do
import pandas as pd
df=pd.DataFrame({0:[1,2,3,4,5,6],1:[None,None,None,None,5,6],2:[None,None,3,4,5,6]})
df.fillna(method='ffill', axis=1)
print(df)
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 5.0 5.0
5 6.0 6.0 6.0
for arbitrary columns
df[1]=np.where(df[1].isnull(),df[0],df[1])
df[2]=np.where(df[2].isnull(),df[0],df[2])
CodePudding user response:
It's filling your NA's according to first column (column 0 a.k.a df[0]
).
0 1 2
0 *0.895575 0.522721 0.012833
1 **0.522721 0.522721 0.012833
2 ***0.012833 0.522721 0.558843
3 0.258442 0.522721 0.772859
4 0.900045 0.026117 0.720966
5 0.913345 0.677905 0.501755
6 0.907725 0.080543 0.881279
So if you had NA's in the first column, it would be replaced by the value of the first row of df[0]. In your example, that value would be 0.895575.
For NA's in your second column (df[1]
), it's using the the second row of the column you specified (df[0]
, the first column). So, all NA's are filled with 0.522721
For NA's in your third column (df[2]
), it's using the the third row of the column you specified (df[0]
, the first column). So, all NA's are filled with 0.012833
Hope this helps!
Edit: I suspect geekay's solution will accomplish what you had intended:
df.fillna(method='ffill', axis=1)