Home > database >  Python: Removing value from cell if it's not always the successor of other column-values
Python: Removing value from cell if it's not always the successor of other column-values

Time:05-25

I've been fighting with a problem for days now. I have a list of 15k article names and separated the first two words of the article name to be able to identify the brand. My logic here is that for every fabricator, check if the second word always follows the same first word. If this is true, this indicates that the split is the full brand name and everything is ok. If not, the second word shall be removed, as it indicates that not only the brand name, but part of the product name is also included. Checking the fabricator is important as 'split1' might be the same for different fabricators, indicating a different brand.

TL;DR: Set 'split2' to NaN if it not always the succesor of 'split1' & 'fabricator'.

This is a simplified version of my dataframe:

d = {'fabricator':['coca cola corp.','coca cola corp.','coca cola corp.','haribo ltd','haribo ltd','haribo ltd'],'product name': ['coca cola light', 'coca cola zero', 'fanta', 'haribo gold bears', 'haribo gold bears soft','haribo berries'], 'split1': ['coca', 'coca','fanta', 'haribo', 'haribo','haribo'], 'split2': ['cola', 'cola',np.nan, 'gold','gold', 'berries']}
df = pd.DataFrame(data=d)
print(df)
        fabricator            product name  split1   split2
0  coca cola corp.         coca cola light    coca     cola
1  coca cola corp.          coca cola zero    coca     cola
2  coca cola corp.                   fanta   fanta      NaN
3       haribo ltd       haribo gold bears  haribo     gold
4       haribo ltd  haribo gold bears soft  haribo     gold
5       haribo ltd          haribo berries  haribo  berries

This is how it shall look like in the end:

        fabricator            product name  split1 split2
0  coca cola corp.         coca cola light    coca   cola
1  coca cola corp.          coca cola zero    coca   cola
2  coca cola corp.                   fanta   fanta    NaN
3       haribo ltd       haribo gold bears  haribo    NaN
4       haribo ltd  haribo gold bears soft  haribo    NaN
5       haribo ltd          haribo berries  haribo    NaN 

This code works as expected for a small df, but is extremely slow for a large dataset & not "pythonesque":

for h in df['fabricator']:
     for s1 in df['split1']:
         if df.loc[(df['fabricator] == f) & (df['split1'] == s1), 'split2'].nunique() > 1:
             df.loc[(df['fabricator] == f) & (df[split1'] == s1), 'split2'] = str('')

I tried combining groupby(['fabricator','split1']) with .duplicated() and .transform() with no success. Any help would be much appreciated! Thank you!

CodePudding user response:

A straight forward way could be to create a filter based on number of values in split2 for each value of split1:

filter_ = df.groupby('split1').agg({'split2':'nunique'}).rename(columns = {'split2':'split2_filter'}).lt(2).reset_index()
filter_

This would look like:

split1 split2_filter
0 coca True
1 fanta True
2 haribo False

Now you can simply merge your original dataframe with this:

df = df.merge(filter_, on='split1')

And your dataframe now has a column on which you want to filter:

fabricator product name split1 split2 split2_filter
0 coca cola corp. coca cola light coca cola True
1 coca cola corp. coca cola zero coca cola True
2 coca cola corp. fanta fanta nan True
3 haribo ltd haribo gold bears haribo gold False
4 haribo ltd haribo gold bears soft haribo gold False
5 haribo ltd haribo berries haribo berries False

Now you can filter as simply as:

df.loc[~df.split2_filter, 'split2'] = np.nan

And your dataframe looks like:

fabricator product name split1 split2 split2_filter
0 coca cola corp. coca cola light coca cola True
1 coca cola corp. coca cola zero coca cola True
2 coca cola corp. fanta fanta nan True
3 haribo ltd haribo gold bears haribo nan False
4 haribo ltd haribo gold bears soft haribo nan False
5 haribo ltd haribo berries haribo nan False

You can also optionally drop the filter column:

df.drop(columns=['split2_filter'])

And your dataframe will look like:

fabricator product name split1 split2
0 coca cola corp. coca cola light coca cola
1 coca cola corp. coca cola zero coca cola
2 coca cola corp. fanta fanta nan
3 haribo ltd haribo gold bears haribo nan
4 haribo ltd haribo gold bears soft haribo nan
5 haribo ltd haribo berries haribo nan

CodePudding user response:

This should do the trick (it has been substantially edited from when I first posted it):

pair_dict = {}
nanmapset = set()
for split1, split2 in df[['split1', 'split2']].values:
    if split1 not in pair_dict:
        pair_dict[split1] = split2
    elif split2 != pair_dict[split1]:
        nanmapset.add((split1, split2))
        nanmapset.add((split1, pair_dict[split1]))
df['split2'] = df.apply(lambda row: row['split2'] if (row['split1'], row['split2']) not in nanmapset else np.nan, axis=1)
  • Related