I have a dataframe like below:
dummy_df_dict = {'Email':['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Transaction_Country': ['CA', 'No Country Listed', 'No Country Listed', 'DE'],
'Country_name': ['Canada', 'No Country Listed', 'No Country Listed', 'Germany'],
'Continent':['North America', 'No Contient listed', 'No Contient listed', 'Europe']}
Email Transaction_Country Country_name Continent
0 [email protected] CA Canada North America
1 [email protected] No Country Listed No Country Listed No Contient listed
2 [email protected] No Country Listed No Country Listed No Contient listed
3 [email protected] DE Germany Europe
I'm just trying to replace the "No "XYZ" listed text with the actual country / continent based on the email above / below.
So for example, the 2nd row [1] instead of showing 'Transaction_Country' as 'No Country Listed' it instead would be replaced with the value 'CA'. And then on the next row, for johnsmith it would be replaced with "DE".
Many thanks!
CodePudding user response:
df = df.replace('No Country Listed', np.nan).replace('No Contient listed', np.nan)
df = df.sort_values(['Email', 'Transaction_Country']).groupby('Email')[df.columns].ffill()
print(df)
Output:
Email Transaction_Country Country_name Continent
0 [email protected] CA Canada North America
1 [email protected] CA Canada North America
3 [email protected] DE Germany Europe
2 [email protected] DE Germany Europe
Looking at it again, I think just this works as well:
df = df.replace('No Country Listed', np.nan).replace('No Contient listed', np.nan)
df = df.sort_values(['Email', 'Transaction_Country']).ffill()
CodePudding user response:
You can use pandas.Series.ffill
to propagate last country forward:
mask = df["Transaction_Country"].eq("No Country Listed")
df.loc[mask, "Transaction_Country"] = np.nan
df["Transaction_Country"].ffill(inplace=True)
CodePudding user response:
You fill the 'No Country Listed' and 'No Contient listed' values with NaN using DataFrame.mask
and then fill the missing values with the first valid country and continent by e-mail.
masked_missing = df.mask(df.isin(['No Country Listed', 'No Contient listed']))
res = masked_missing.fillna(masked_missing.groupby('Email').transform('first'))
Output:
>>> res
Email Transaction_Country Country_name Continent
0 [email protected] CA Canada North America
1 [email protected] CA Canada North America
2 [email protected] DE Germany Europe
3 [email protected] DE Germany Europe
Setup:
import pandas as pd
df = pd.DataFrame({
'Email':['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
'Transaction_Country': ['CA', 'No Country Listed', 'No Country Listed', 'DE'],
'Country_name': ['Canada', 'No Country Listed', 'No Country Listed', 'Germany'],
'Continent':['North America', 'No Contient listed', 'No Contient listed', 'Europe']
})