Suppose a dataframe with a column of company names:
namelist = ['canadian pacific railway',
'nestlé canada',
'nestlé',
'chicken farmers of canada',
'cenovus energy',
'merck frosst canada',
'food banks canada',
'canadian fertilizer institute',
'balanceco',
'bell textron canada',
'safran landing systems canada',
'airbus canada',
'airbus',
'investment counsel association of canada',
'teck resources',
'fertilizer canada',
'engineers canada',
'google',
'google canada']
s = pd.Series(namelist)
I would like to isolate all rows of company names that are duplicated, whether or not they contain the word "canada". In this example, the filtered column should contain:
'nestlé canada'
'nestlé'
'airbus canada'
'airbus'
'google'
'google canada'
The goal is to standardize the names to a single form.
I can't wholesale remove the word because the are other company names that that word I want to preserve, like "fertilizer canada" and "engineers canada".
Is there a regex pattern or another clever way to get this?
CodePudding user response:
You can replace the trailing "canada" and then use duplicated
to construct a Boolean mask to get the values from the original Series. (With keep=False
, all duplicates are marked as True
.)
s[s.str.replace(' canada$', '').duplicated(keep=False)]
CodePudding user response:
s[s.str.replace('\s*canada\s*', '', regex=True).duplicated(keep=False)]
Output:
1 nestlé canada
2 nestlé
10 airbus canada
11 airbus
16 google
17 google canada
dtype: object