I am conducting a back and forward fill on a column on a DataFrame based on an ID.
Attempted solution:
def my_func(x):
if merged['email_conf'].isnull().all():
return x.bfill().ffill()
else:
return merged['email_conf'].bfill().ffill()
merged['email_unconf'] = merged.groupby('id')['email_unconf'].apply(my_func)
There are some instances where the each id has more than one email for example. I have a third column called email_conf
and if within the group by
that column contains an email, I use that one to back and forward fill (if it appears in email_conf
the email will also be present in email_unconf
. Otherwise I just use the email_unconf
column to b/f fill.
Starting dataframe:
id email_unconf email_conf
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 [email protected] [email protected]
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 [email protected] NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
5827553206045509 NaN NaN
.
.
.
5827553206045510 [email protected] NaN
5827553206045510 NaN NaN
5827553206045510 NaN NaN
5827553206045510 NaN NaN
.
.
.
Resulting Df:
vid email_unconf email_conf
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] [email protected]
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
5827553206045509 [email protected] NaN
.
.
.
5827553206045510 [email protected] NaN
5827553206045510 [email protected] NaN
5827553206045510 [email protected] NaN
5827553206045510 [email protected] NaN
.
.
.
CodePudding user response:
Make a dictionary to map the values from the 'email_conf'
, then forward fill and back fill the rest.
import pandas as pd
import numpy as np
cols = ['id','email_unconf','email_conf']
data = [
['5827553206045509', '[email protected]', np.nan],
['5827553206045509', '[email protected]', np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', '[email protected]', '[email protected]'],
['5827553206045509', np.nan, np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', '[email protected]', np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045509', np.nan, np.nan],
['5827553206045510', np.nan, np.nan],
['5827553206045510', '[email protected]', np.nan],
['5827553206045510', np.nan, np.nan],
['5827553206045510', np.nan, np.nan]]
df = pd.DataFrame(data, columns=cols)
not_null = df[~df['email_conf'].isna()]
email_dict = dict(zip(list(not_null['id']), list(not_null['email_conf'])))
df['email_unconf_fill'] = df['id'].map(email_dict).fillna(df['email_unconf'])
df['email_unconf_fill'] = df.groupby('id')['email_unconf_fill'].ffill()
df['email_unconf_fill'] = df.groupby('id')['email_unconf_fill'].bfill()
df['email_unconf'] = df['email_unconf_fill']
df = df.drop('email_unconf_fill', axis=1)
Output:
print(df)
id email_unconf email_conf
0 5827553206045509 [email protected] NaN
1 5827553206045509 [email protected] NaN
2 5827553206045509 [email protected] NaN
3 5827553206045509 [email protected] NaN
4 5827553206045509 [email protected] [email protected]
5 5827553206045509 [email protected] NaN
6 5827553206045509 [email protected] NaN
7 5827553206045509 [email protected] NaN
8 5827553206045509 [email protected] NaN
9 5827553206045509 [email protected] NaN
10 5827553206045509 [email protected] NaN
11 5827553206045509 [email protected] NaN
12 5827553206045509 [email protected] NaN
13 5827553206045509 [email protected] NaN
14 5827553206045510 [email protected] NaN
15 5827553206045510 [email protected] NaN
16 5827553206045510 [email protected] NaN
17 5827553206045510 [email protected] NaN