Home > Back-end >  Pandas replace values in columns based on condtion
Pandas replace values in columns based on condtion

Time:06-16

I have a dataframe like below:

dummy_df_dict  = {'Email':['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
              'Transaction_Country': ['CA', 'No Country Listed', 'No Country Listed', 'DE'],
             'Country_name': ['Canada', 'No Country Listed', 'No Country Listed', 'Germany'],
                       'Continent':['North America', 'No Contient listed', 'No Contient listed', 'Europe']}
    Email   Transaction_Country Country_name    Continent
0   [email protected]   CA  Canada  North America
1   [email protected]   No Country Listed   No Country Listed   No Contient listed
2   [email protected] No Country Listed   No Country Listed   No Contient listed
3   [email protected] DE  Germany Europe

I'm just trying to replace the "No "XYZ" listed text with the actual country / continent based on the email above / below.

So for example, the 2nd row [1] instead of showing 'Transaction_Country' as 'No Country Listed' it instead would be replaced with the value 'CA'. And then on the next row, for johnsmith it would be replaced with "DE".

Many thanks!

CodePudding user response:

df = df.replace('No Country Listed', np.nan).replace('No Contient listed', np.nan)
df = df.sort_values(['Email', 'Transaction_Country']).groupby('Email')[df.columns].ffill()
print(df)

Output:

                 Email Transaction_Country Country_name      Continent
0    [email protected]                  CA       Canada  North America
1    [email protected]                  CA       Canada  North America
3  [email protected]                  DE      Germany         Europe
2  [email protected]                  DE      Germany         Europe

Looking at it again, I think just this works as well:

df = df.replace('No Country Listed', np.nan).replace('No Contient listed', np.nan)
df = df.sort_values(['Email', 'Transaction_Country']).ffill()

CodePudding user response:

You can use pandas.Series.ffill to propagate last country forward:

mask = df["Transaction_Country"].eq("No Country Listed")
df.loc[mask, "Transaction_Country"] = np.nan
df["Transaction_Country"].ffill(inplace=True)

CodePudding user response:

You fill the 'No Country Listed' and 'No Contient listed' values with NaN using DataFrame.mask and then fill the missing values with the first valid country and continent by e-mail.

masked_missing = df.mask(df.isin(['No Country Listed', 'No Contient listed']))

res = masked_missing.fillna(masked_missing.groupby('Email').transform('first')) 

Output:

>>> res 

                 Email Transaction_Country Country_name      Continent
0    [email protected]                  CA       Canada  North America
1    [email protected]                  CA       Canada  North America
2  [email protected]                  DE      Germany         Europe
3  [email protected]                  DE      Germany         Europe

Setup:

import pandas as pd

df = pd.DataFrame({
    'Email':['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
    'Transaction_Country': ['CA', 'No Country Listed', 'No Country Listed', 'DE'],
    'Country_name': ['Canada', 'No Country Listed', 'No Country Listed', 'Germany'],
    'Continent':['North America', 'No Contient listed', 'No Contient listed', 'Europe']
})
  • Related