I am trying to fill in empty rows in a DataFrame by propagating known values to values within the same group. This seems to work fine when all the columns grouped on are full, but if there are empty cells in a column, Pandas will delete values. In the provided example, the DataFrame has NaN in column "B" at indices 6 and 8. After using ffill
and bfill
on the groups, the numbers in rows 6 and 8 have been replaced with NaN. How do I avoid this unintended side effect?
Code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', 'b', 'b', 'b', 'c', np.nan, 'd',
np.nan, 'e', 'e', 'g', 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan,
np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, 3.0],
"D": [1.0, 1.0, np.nan, 2.0, np.nan, np.nan, np.nan,
np.nan, 3.0, 2.0, np.nan, 9.0, np.nan, 3.0],
"E": [np.nan, 6.0, np.nan, 3.0, np.nan, np.nan, 7.0,
np.nan, 7.0, 2.0, np.nan, np.nan, np.nan, 0.0]})
cols_to_groupby = ["A", "B"]
cols_to_fill = ["C", "D", "E"]
original_indxs = df.index
df[cols_to_fill] = (
df.sort_values(cols_to_groupby)[cols_to_fill cols_to_groupby]
.groupby(cols_to_groupby)
.transform(lambda x: x.ffill().bfill())
)
Sample Dataset:
A B C D E
0 1 a 5.0 1.0 NaN
1 1 a NaN 1.0 6.0
2 2 b 4.0 NaN NaN
3 2 b 4.0 2.0 3.0
4 2 b NaN NaN NaN
5 3 c 9.0 NaN NaN
6 3 NaN NaN NaN 7.0
7 3 d NaN NaN NaN
8 3 NaN 9.0 3.0 7.0
9 4 e 8.0 2.0 2.0
10 4 e NaN NaN NaN
11 5 g 2.0 9.0 NaN
12 6 h NaN NaN NaN
13 6 h 3.0 3.0 0.0
Expected Result:
A B C D E
0 1 a 5.0 1.0 6.0
1 1 a 5.0 1.0 6.0
2 2 b 4.0 2.0 3.0
3 2 b 4.0 2.0 3.0
4 2 b 4.0 2.0 3.0
5 3 c 9.0 NaN NaN
6 3 NaN NaN NaN 7.0
7 3 d NaN NaN NaN
8 3 NaN 9.0 3.0 7.0
9 4 e 8.0 2.0 2.0
10 4 e 8.0 2.0 2.0
11 5 g 2.0 9.0 NaN
12 6 h 3.0 3.0 0.0
13 6 h 3.0 3.0 0.0
Actual Result:
A B C D E
0 1 a 5.0 1.0 6.0
1 1 a 5.0 1.0 6.0
2 2 b 4.0 2.0 3.0
3 2 b 4.0 2.0 3.0
4 2 b 4.0 2.0 3.0
5 3 c 9.0 NaN NaN
6 3 NaN NaN NaN NaN
7 3 d NaN NaN NaN
8 3 NaN NaN NaN NaN
9 4 e 8.0 2.0 2.0
10 4 e 8.0 2.0 2.0
11 5 g 2.0 9.0 NaN
12 6 h 3.0 3.0 0.0
13 6 h 3.0 3.0 0.0
CodePudding user response:
By default groupby
will ignore the rows where the group key columns contain NaN
values, So you can not rely on index alignment (df[cols_to_fill] = ...
) to update the dataframe here we have to explicitly update
the original dataframe with the result from groupby
cols = df.groupby(cols_to_groupby)[cols_to_fill].apply(lambda x: x.ffill().bfill())
df.update(cols)
A B C D E
0 1 a 5.0 1.0 6.0
1 1 a 5.0 1.0 6.0
2 2 b 4.0 2.0 3.0
3 2 b 4.0 2.0 3.0
4 2 b 4.0 2.0 3.0
5 3 c 9.0 NaN NaN
6 3 NaN NaN NaN 7.0
7 3 d NaN NaN NaN
8 3 NaN 9.0 3.0 7.0
9 4 e 8.0 2.0 2.0
10 4 e 8.0 2.0 2.0
11 5 g 2.0 9.0 NaN
12 6 h 3.0 3.0 0.0
13 6 h 3.0 3.0 0.0
CodePudding user response:
One way is to filter out the rows with NaN and fill only those without NaN in column "B"; then concatenate the remainder back:
notna_df = df[df['B'].notna()]
out = (pd.concat((notna_df.drop(columns=cols_to_fill)
.join(notna_df.groupby(cols_to_groupby)[cols_to_fill]
.apply(lambda x: x.bfill().ffill())),
df[df['B'].isna()]))
.sort_index())
Output:
A B C D E
0 1 a 5.0 1.0 6.0
1 1 a 5.0 1.0 6.0
2 2 b 4.0 2.0 3.0
3 2 b 4.0 2.0 3.0
4 2 b 4.0 2.0 3.0
5 3 c 9.0 NaN NaN
6 3 NaN NaN NaN 7.0
7 3 d NaN NaN NaN
8 3 NaN 9.0 3.0 7.0
9 4 e 8.0 2.0 2.0
10 4 e 8.0 2.0 2.0
11 5 g 2.0 9.0 NaN
12 6 h 3.0 3.0 0.0
13 6 h 3.0 3.0 0.0