Python: Removing value from cell if it's not always the successor of other column-values-CodePudding

I've been fighting with a problem for days now. I have a list of 15k article names and separated the first two words of the article name to be able to identify the brand. My logic here is that for every fabricator, check if the second word always follows the same first word. If this is true, this indicates that the split is the full brand name and everything is ok. If not, the second word shall be removed, as it indicates that not only the brand name, but part of the product name is also included. Checking the fabricator is important as 'split1' might be the same for different fabricators, indicating a different brand.

TL;DR: Set 'split2' to NaN if it not always the succesor of 'split1' & 'fabricator'.

This is a simplified version of my dataframe:

d = {'fabricator':['coca cola corp.','coca cola corp.','coca cola corp.','haribo ltd','haribo ltd','haribo ltd'],'product name': ['coca cola light', 'coca cola zero', 'fanta', 'haribo gold bears', 'haribo gold bears soft','haribo berries'], 'split1': ['coca', 'coca','fanta', 'haribo', 'haribo','haribo'], 'split2': ['cola', 'cola',np.nan, 'gold','gold', 'berries']}
df = pd.DataFrame(data=d)
print(df)

        fabricator            product name  split1   split2
0  coca cola corp.         coca cola light    coca     cola
1  coca cola corp.          coca cola zero    coca     cola
2  coca cola corp.                   fanta   fanta      NaN
3       haribo ltd       haribo gold bears  haribo     gold
4       haribo ltd  haribo gold bears soft  haribo     gold
5       haribo ltd          haribo berries  haribo  berries

This is how it shall look like in the end:

        fabricator            product name  split1 split2
0  coca cola corp.         coca cola light    coca   cola
1  coca cola corp.          coca cola zero    coca   cola
2  coca cola corp.                   fanta   fanta    NaN
3       haribo ltd       haribo gold bears  haribo    NaN
4       haribo ltd  haribo gold bears soft  haribo    NaN
5       haribo ltd          haribo berries  haribo    NaN

This code works as expected for a small df, but is extremely slow for a large dataset & not "pythonesque":

for h in df['fabricator']:
     for s1 in df['split1']:
         if df.loc[(df['fabricator] == f) & (df['split1'] == s1), 'split2'].nunique() > 1:
             df.loc[(df['fabricator] == f) & (df[split1'] == s1), 'split2'] = str('')

I tried combining groupby(['fabricator','split1']) with .duplicated() and .transform() with no success. Any help would be much appreciated! Thank you!

CodePudding user response：

A straight forward way could be to create a filter based on number of values in split2 for each value of split1:

filter_ = df.groupby('split1').agg({'split2':'nunique'}).rename(columns = {'split2':'split2_filter'}).lt(2).reset_index()
filter_

This would look like:

	split1	split2_filter
0	coca	True
1	fanta	True
2	haribo	False

Now you can simply merge your original dataframe with this:

df = df.merge(filter_, on='split1')

And your dataframe now has a column on which you want to filter:

	fabricator	product name	split1	split2	split2_filter
0	coca cola corp.	coca cola light	coca	cola	True
1	coca cola corp.	coca cola zero	coca	cola	True
2	coca cola corp.	fanta	fanta	nan	True
3	haribo ltd	haribo gold bears	haribo	gold	False
4	haribo ltd	haribo gold bears soft	haribo	gold	False
5	haribo ltd	haribo berries	haribo	berries	False

Now you can filter as simply as:

df.loc[~df.split2_filter, 'split2'] = np.nan

And your dataframe looks like:

	fabricator	product name	split1	split2	split2_filter
0	coca cola corp.	coca cola light	coca	cola	True
1	coca cola corp.	coca cola zero	coca	cola	True
2	coca cola corp.	fanta	fanta	nan	True
3	haribo ltd	haribo gold bears	haribo	nan	False
4	haribo ltd	haribo gold bears soft	haribo	nan	False
5	haribo ltd	haribo berries	haribo	nan	False

You can also optionally drop the filter column:

df.drop(columns=['split2_filter'])

And your dataframe will look like:

	fabricator	product name	split1	split2
0	coca cola corp.	coca cola light	coca	cola
1	coca cola corp.	coca cola zero	coca	cola
2	coca cola corp.	fanta	fanta	nan
3	haribo ltd	haribo gold bears	haribo	nan
4	haribo ltd	haribo gold bears soft	haribo	nan
5	haribo ltd	haribo berries	haribo	nan

CodePudding user response：

This should do the trick (it has been substantially edited from when I first posted it):

pair_dict = {}
nanmapset = set()
for split1, split2 in df[['split1', 'split2']].values:
    if split1 not in pair_dict:
        pair_dict[split1] = split2
    elif split2 != pair_dict[split1]:
        nanmapset.add((split1, split2))
        nanmapset.add((split1, pair_dict[split1]))
df['split2'] = df.apply(lambda row: row['split2'] if (row['split1'], row['split2']) not in nanmapset else np.nan, axis=1)