Home > Back-end >  Pandas deleting cells when filling groups with NaN in groupby column
Pandas deleting cells when filling groups with NaN in groupby column

Time:03-16

I am trying to fill in empty rows in a DataFrame by propagating known values to values within the same group. This seems to work fine when all the columns grouped on are full, but if there are empty cells in a column, Pandas will delete values. In the provided example, the DataFrame has NaN in column "B" at indices 6 and 8. After using ffill and bfill on the groups, the numbers in rows 6 and 8 have been replaced with NaN. How do I avoid this unintended side effect?

Code:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
                    "B": ['a', 'a', 'b', 'b', 'b', 'c', np.nan, 'd', 
                          np.nan, 'e', 'e', 'g', 'h', 'h'],
                    "C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, 
                          np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, 3.0],
                    "D": [1.0, 1.0, np.nan, 2.0, np.nan, np.nan, np.nan, 
                          np.nan, 3.0, 2.0, np.nan, 9.0, np.nan, 3.0],
                    "E": [np.nan, 6.0, np.nan, 3.0, np.nan, np.nan, 7.0, 
                          np.nan, 7.0, 2.0, np.nan, np.nan, np.nan, 0.0]})

cols_to_groupby = ["A", "B"]
cols_to_fill = ["C", "D", "E"]

original_indxs = df.index
df[cols_to_fill] = (
    df.sort_values(cols_to_groupby)[cols_to_fill   cols_to_groupby]
    .groupby(cols_to_groupby)
    .transform(lambda x: x.ffill().bfill())
)

Sample Dataset:

    A    B    C    D    E
0   1    a  5.0  1.0  NaN
1   1    a  NaN  1.0  6.0
2   2    b  4.0  NaN  NaN
3   2    b  4.0  2.0  3.0
4   2    b  NaN  NaN  NaN
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  7.0
7   3    d  NaN  NaN  NaN
8   3  NaN  9.0  3.0  7.0
9   4    e  8.0  2.0  2.0
10  4    e  NaN  NaN  NaN
11  5    g  2.0  9.0  NaN
12  6    h  NaN  NaN  NaN
13  6    h  3.0  3.0  0.0

Expected Result:

    A    B    C    D    E
0   1    a  5.0  1.0  6.0
1   1    a  5.0  1.0  6.0
2   2    b  4.0  2.0  3.0
3   2    b  4.0  2.0  3.0
4   2    b  4.0  2.0  3.0
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  7.0
7   3    d  NaN  NaN  NaN
8   3  NaN  9.0  3.0  7.0
9   4    e  8.0  2.0  2.0
10  4    e  8.0  2.0  2.0
11  5    g  2.0  9.0  NaN
12  6    h  3.0  3.0  0.0
13  6    h  3.0  3.0  0.0

Actual Result:

    A    B    C    D    E
0   1    a  5.0  1.0  6.0
1   1    a  5.0  1.0  6.0
2   2    b  4.0  2.0  3.0
3   2    b  4.0  2.0  3.0
4   2    b  4.0  2.0  3.0
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  NaN
7   3    d  NaN  NaN  NaN
8   3  NaN  NaN  NaN  NaN
9   4    e  8.0  2.0  2.0
10  4    e  8.0  2.0  2.0
11  5    g  2.0  9.0  NaN
12  6    h  3.0  3.0  0.0
13  6    h  3.0  3.0  0.0

CodePudding user response:

By default groupby will ignore the rows where the group key columns contain NaN values, So you can not rely on index alignment (df[cols_to_fill] = ...) to update the dataframe here we have to explicitly update the original dataframe with the result from groupby

cols = df.groupby(cols_to_groupby)[cols_to_fill].apply(lambda x: x.ffill().bfill())
df.update(cols)

    A    B    C    D    E
0   1    a  5.0  1.0  6.0
1   1    a  5.0  1.0  6.0
2   2    b  4.0  2.0  3.0
3   2    b  4.0  2.0  3.0
4   2    b  4.0  2.0  3.0
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  7.0
7   3    d  NaN  NaN  NaN
8   3  NaN  9.0  3.0  7.0
9   4    e  8.0  2.0  2.0
10  4    e  8.0  2.0  2.0
11  5    g  2.0  9.0  NaN
12  6    h  3.0  3.0  0.0
13  6    h  3.0  3.0  0.0

CodePudding user response:

One way is to filter out the rows with NaN and fill only those without NaN in column "B"; then concatenate the remainder back:

notna_df = df[df['B'].notna()]
out = (pd.concat((notna_df.drop(columns=cols_to_fill)
                 .join(notna_df.groupby(cols_to_groupby)[cols_to_fill]
                       .apply(lambda x: x.bfill().ffill())), 
                 df[df['B'].isna()]))
       .sort_index())

Output:

    A    B    C    D    E
0   1    a  5.0  1.0  6.0
1   1    a  5.0  1.0  6.0
2   2    b  4.0  2.0  3.0
3   2    b  4.0  2.0  3.0
4   2    b  4.0  2.0  3.0
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  7.0
7   3    d  NaN  NaN  NaN
8   3  NaN  9.0  3.0  7.0
9   4    e  8.0  2.0  2.0
10  4    e  8.0  2.0  2.0
11  5    g  2.0  9.0  NaN
12  6    h  3.0  3.0  0.0
13  6    h  3.0  3.0  0.0
  • Related