Pandas replace values in columns based on condtion-CodePudding

I have a dataframe like below:

dummy_df_dict  = {'Email':['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
              'Transaction_Country': ['CA', 'No Country Listed', 'No Country Listed', 'DE'],
             'Country_name': ['Canada', 'No Country Listed', 'No Country Listed', 'Germany'],
                       'Continent':['North America', 'No Contient listed', 'No Contient listed', 'Europe']}

    Email   Transaction_Country Country_name    Continent
0   [email protected]   CA  Canada  North America
1   [email protected]   No Country Listed   No Country Listed   No Contient listed
2   [email protected] No Country Listed   No Country Listed   No Contient listed
3   [email protected] DE  Germany Europe

I'm just trying to replace the "No "XYZ" listed text with the actual country / continent based on the email above / below.

So for example, the 2nd row [1] instead of showing 'Transaction_Country' as 'No Country Listed' it instead would be replaced with the value 'CA'. And then on the next row, for johnsmith it would be replaced with "DE".

Many thanks!

CodePudding user response：

df = df.replace('No Country Listed', np.nan).replace('No Contient listed', np.nan)
df = df.sort_values(['Email', 'Transaction_Country']).groupby('Email')[df.columns].ffill()
print(df)

Output:

                 Email Transaction_Country Country_name      Continent
0    [email protected]                  CA       Canada  North America
1    [email protected]                  CA       Canada  North America
3  [email protected]                  DE      Germany         Europe
2  [email protected]                  DE      Germany         Europe

Looking at it again, I think just this works as well:

df = df.replace('No Country Listed', np.nan).replace('No Contient listed', np.nan)
df = df.sort_values(['Email', 'Transaction_Country']).ffill()

CodePudding user response：

You can use pandas.Series.ffill to propagate last country forward:

mask = df["Transaction_Country"].eq("No Country Listed")
df.loc[mask, "Transaction_Country"] = np.nan
df["Transaction_Country"].ffill(inplace=True)

CodePudding user response：

You fill the 'No Country Listed' and 'No Contient listed' values with NaN using DataFrame.mask and then fill the missing values with the first valid country and continent by e-mail.

masked_missing = df.mask(df.isin(['No Country Listed', 'No Contient listed']))

res = masked_missing.fillna(masked_missing.groupby('Email').transform('first'))

Output:

>>> res 

                 Email Transaction_Country Country_name      Continent
0    [email protected]                  CA       Canada  North America
1    [email protected]                  CA       Canada  North America
2  [email protected]                  DE      Germany         Europe
3  [email protected]                  DE      Germany         Europe

Setup:

import pandas as pd

df = pd.DataFrame({
    'Email':['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
    'Transaction_Country': ['CA', 'No Country Listed', 'No Country Listed', 'DE'],
    'Country_name': ['Canada', 'No Country Listed', 'No Country Listed', 'Germany'],
    'Continent':['North America', 'No Contient listed', 'No Contient listed', 'Europe']
})