Home > other >  Back/Forward fill DataFrameGroup by based on column condition
Back/Forward fill DataFrameGroup by based on column condition

Time:09-02

I am conducting a back and forward fill on a column on a DataFrame based on an ID.

Attempted solution:

def my_func(x):
    if merged['email_conf'].isnull().all():
        return x.bfill().ffill()
    else:
        return merged['email_conf'].bfill().ffill()

merged['email_unconf'] = merged.groupby('id')['email_unconf'].apply(my_func) 

There are some instances where the each id has more than one email for example. I have a third column called email_conf and if within the group by that column contains an email, I use that one to back and forward fill (if it appears in email_conf the email will also be present in email_unconf. Otherwise I just use the email_unconf column to b/f fill.

Starting dataframe:

            id          email_unconf      email_conf
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    [email protected]   [email protected]
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    5827553206045509    NaN               NaN
    .
    . 
    .
    5827553206045510    [email protected]    NaN
    5827553206045510    NaN               NaN
    5827553206045510    NaN               NaN
    5827553206045510    NaN               NaN
    .
    .
    .
    

Resulting Df:

            vid         email_unconf      email_conf
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]   [email protected]
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    5827553206045509    [email protected]    NaN
    .
    . 
    .
    5827553206045510    [email protected]    NaN
    5827553206045510    [email protected]    NaN
    5827553206045510    [email protected]    NaN
    5827553206045510    [email protected]    NaN
    .
    .
    .

CodePudding user response:

Make a dictionary to map the values from the 'email_conf', then forward fill and back fill the rest.

import pandas as pd
import numpy as np


cols = ['id','email_unconf','email_conf']
data = [
    ['5827553206045509',    '[email protected]',    np.nan],
    ['5827553206045509',    '[email protected]',    np.nan],
    ['5827553206045509',    np.nan,               np.nan],
    ['5827553206045509',    np.nan,              np.nan],
    ['5827553206045509',    '[email protected]',   '[email protected]'],
    ['5827553206045509',    np.nan,               np.nan],
    ['5827553206045509',    np.nan,                np.nan],
    ['5827553206045509',    np.nan,               np.nan],
    ['5827553206045509',    np.nan,                np.nan],
    ['5827553206045509',    np.nan,              np.nan],
    ['5827553206045509',    '[email protected]',    np.nan],
    ['5827553206045509',    np.nan,               np.nan],
    ['5827553206045509',    np.nan,                np.nan],
    ['5827553206045509',    np.nan,               np.nan],
    ['5827553206045510',    np.nan,    np.nan],
    ['5827553206045510',    '[email protected]',                np.nan],
    ['5827553206045510',    np.nan,               np.nan],
    ['5827553206045510',    np.nan,               np.nan]]

df = pd.DataFrame(data, columns=cols)

not_null = df[~df['email_conf'].isna()]
email_dict = dict(zip(list(not_null['id']), list(not_null['email_conf'])))

df['email_unconf_fill'] = df['id'].map(email_dict).fillna(df['email_unconf'])
df['email_unconf_fill'] = df.groupby('id')['email_unconf_fill'].ffill()
df['email_unconf_fill'] = df.groupby('id')['email_unconf_fill'].bfill()


df['email_unconf'] = df['email_unconf_fill']
df = df.drop('email_unconf_fill', axis=1)

Output:

print(df)
                  id     email_unconf       email_conf
0   5827553206045509  [email protected]              NaN
1   5827553206045509  [email protected]              NaN
2   5827553206045509  [email protected]              NaN
3   5827553206045509  [email protected]              NaN
4   5827553206045509  [email protected]  [email protected]
5   5827553206045509  [email protected]              NaN
6   5827553206045509  [email protected]              NaN
7   5827553206045509  [email protected]              NaN
8   5827553206045509  [email protected]              NaN
9   5827553206045509  [email protected]              NaN
10  5827553206045509  [email protected]              NaN
11  5827553206045509  [email protected]              NaN
12  5827553206045509  [email protected]              NaN
13  5827553206045509  [email protected]              NaN
14  5827553206045510   [email protected]              NaN
15  5827553206045510   [email protected]              NaN
16  5827553206045510   [email protected]              NaN
17  5827553206045510   [email protected]              NaN
  • Related