How to replace strings in pandas column that are in a list?-CodePudding

I have scrolled through the posts on this question and was unable to find an answer to my situation. I have a pandas dataframe with a list of company names, some of which are represented as their domain name.

df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])

I want to remove the domain extension from all the strings. The extensions are given in a list:

web_domains = ['.com', '.us']

The ollowing attepmt did not yield any results:

df['firm'].str.lower().replace(web_domains, '')

Can someone please help me out, and possibly also explain why my solution does not work?

CodePudding user response：

You need use regex=True for Series.replace since it matches exact string under regex=False.

For example a will only be replaced when target is a not ab nor bab

web_domains = ['\.com', '\.us'] # escape `.` for regex=True

df['removed'] = df['firm'].str.lower().replace(web_domains, '', regex=True)

print(df)

          firm    removed
0    amazon.us     amazon
1        pepsi      pepsi
2  YOUTUBE.COM    youtube
3    apple.inc  apple.inc

CodePudding user response：

Form a regex alternation from the domain endings, then use str.replace to remove them:

web_domains = ['com', 'us']
regex = r'\.(?:'   r'|'.join(web_domains)   r')$'
df["firm"] = df["firm"].str.replace(regex, '', regex=True)

CodePudding user response：

Use str.replace with a regular expression:

import re
import pandas as pd

web_domains = ['.com', '.us']
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])

regex_pattern = f"({ '|'.join(re.escape(wb) for wb in web_domains)})$"
res = df["firm"].str.replace(regex_pattern, "", regex=True, flags=re.IGNORECASE)
print(res)

Output

0       amazon
1        pepsi
2      YOUTUBE
3    apple.inc
Name: firm, dtype: object

The variable regex_pattern points to the string:

(\.com|\.us)$

it will match either .com or .us at the end of the string

CodePudding user response：

explain why my solution does not work?

pandas.Series.str.replace first argument should be

str or compiled regex

you have shoved list here, which is not allowed.