I've been fighting with a problem for days now. I have a list of 15k article names and separated the first two words of the article name to be able to identify the brand. My logic here is that for every fabricator, check if the second word always follows the same first word. If this is true, this indicates that the split is the full brand name and everything is ok. If not, the second word shall be removed, as it indicates that not only the brand name, but part of the product name is also included. Checking the fabricator is important as 'split1' might be the same for different fabricators, indicating a different brand.
TL;DR: Set 'split2' to NaN if it not always the succesor of 'split1' & 'fabricator'.
This is a simplified version of my dataframe:
d = {'fabricator':['coca cola corp.','coca cola corp.','coca cola corp.','haribo ltd','haribo ltd','haribo ltd'],'product name': ['coca cola light', 'coca cola zero', 'fanta', 'haribo gold bears', 'haribo gold bears soft','haribo berries'], 'split1': ['coca', 'coca','fanta', 'haribo', 'haribo','haribo'], 'split2': ['cola', 'cola',np.nan, 'gold','gold', 'berries']}
df = pd.DataFrame(data=d)
print(df)
fabricator product name split1 split2
0 coca cola corp. coca cola light coca cola
1 coca cola corp. coca cola zero coca cola
2 coca cola corp. fanta fanta NaN
3 haribo ltd haribo gold bears haribo gold
4 haribo ltd haribo gold bears soft haribo gold
5 haribo ltd haribo berries haribo berries
This is how it shall look like in the end:
fabricator product name split1 split2
0 coca cola corp. coca cola light coca cola
1 coca cola corp. coca cola zero coca cola
2 coca cola corp. fanta fanta NaN
3 haribo ltd haribo gold bears haribo NaN
4 haribo ltd haribo gold bears soft haribo NaN
5 haribo ltd haribo berries haribo NaN
This code works as expected for a small df, but is extremely slow for a large dataset & not "pythonesque":
for h in df['fabricator']:
for s1 in df['split1']:
if df.loc[(df['fabricator] == f) & (df['split1'] == s1), 'split2'].nunique() > 1:
df.loc[(df['fabricator] == f) & (df[split1'] == s1), 'split2'] = str('')
I tried combining groupby(['fabricator','split1']) with .duplicated() and .transform() with no success. Any help would be much appreciated! Thank you!
CodePudding user response:
A straight forward way could be to create a filter based on number of values in split2
for each value of split1
:
filter_ = df.groupby('split1').agg({'split2':'nunique'}).rename(columns = {'split2':'split2_filter'}).lt(2).reset_index()
filter_
This would look like:
split1 | split2_filter | |
---|---|---|
0 | coca | True |
1 | fanta | True |
2 | haribo | False |
Now you can simply merge your original dataframe with this:
df = df.merge(filter_, on='split1')
And your dataframe now has a column on which you want to filter:
fabricator | product name | split1 | split2 | split2_filter | |
---|---|---|---|---|---|
0 | coca cola corp. | coca cola light | coca | cola | True |
1 | coca cola corp. | coca cola zero | coca | cola | True |
2 | coca cola corp. | fanta | fanta | nan | True |
3 | haribo ltd | haribo gold bears | haribo | gold | False |
4 | haribo ltd | haribo gold bears soft | haribo | gold | False |
5 | haribo ltd | haribo berries | haribo | berries | False |
Now you can filter as simply as:
df.loc[~df.split2_filter, 'split2'] = np.nan
And your dataframe looks like:
fabricator | product name | split1 | split2 | split2_filter | |
---|---|---|---|---|---|
0 | coca cola corp. | coca cola light | coca | cola | True |
1 | coca cola corp. | coca cola zero | coca | cola | True |
2 | coca cola corp. | fanta | fanta | nan | True |
3 | haribo ltd | haribo gold bears | haribo | nan | False |
4 | haribo ltd | haribo gold bears soft | haribo | nan | False |
5 | haribo ltd | haribo berries | haribo | nan | False |
You can also optionally drop the filter column:
df.drop(columns=['split2_filter'])
And your dataframe will look like:
fabricator | product name | split1 | split2 | |
---|---|---|---|---|
0 | coca cola corp. | coca cola light | coca | cola |
1 | coca cola corp. | coca cola zero | coca | cola |
2 | coca cola corp. | fanta | fanta | nan |
3 | haribo ltd | haribo gold bears | haribo | nan |
4 | haribo ltd | haribo gold bears soft | haribo | nan |
5 | haribo ltd | haribo berries | haribo | nan |
CodePudding user response:
This should do the trick (it has been substantially edited from when I first posted it):
pair_dict = {}
nanmapset = set()
for split1, split2 in df[['split1', 'split2']].values:
if split1 not in pair_dict:
pair_dict[split1] = split2
elif split2 != pair_dict[split1]:
nanmapset.add((split1, split2))
nanmapset.add((split1, pair_dict[split1]))
df['split2'] = df.apply(lambda row: row['split2'] if (row['split1'], row['split2']) not in nanmapset else np.nan, axis=1)