Find duplicate values in pandas column that may contain an additional word-CodePudding

Suppose a dataframe with a column of company names:

namelist = ['canadian pacific railway',
 'nestlé canada',
 'nestlé',
 'chicken farmers of canada',
 'cenovus energy',
 'merck frosst canada',
 'food banks canada',
 'canadian fertilizer institute',
 'balanceco',
 'bell textron canada',
 'safran landing systems canada',
 'airbus canada',
 'airbus',
 'investment counsel association of canada',
 'teck resources',
 'fertilizer canada',
 'engineers canada',
 'google',
 'google canada']

s = pd.Series(namelist)

I would like to isolate all rows of company names that are duplicated, whether or not they contain the word "canada". In this example, the filtered column should contain:

 'nestlé canada'
 'nestlé'
 'airbus canada'
 'airbus'
 'google'
 'google canada'

The goal is to standardize the names to a single form.

I can't wholesale remove the word because the are other company names that that word I want to preserve, like "fertilizer canada" and "engineers canada".

Is there a regex pattern or another clever way to get this?

CodePudding user response：

You can replace the trailing "canada" and then use duplicated to construct a Boolean mask to get the values from the original Series. (With keep=False, all duplicates are marked as True.)

s[s.str.replace(' canada$', '').duplicated(keep=False)]

CodePudding user response：

s[s.str.replace('\s*canada\s*', '', regex=True).duplicated(keep=False)]

Output:

1     nestlé canada
2            nestlé
10    airbus canada
11           airbus
16           google
17    google canada
dtype: object