Home > Software engineering >  How to replace strings in pandas column that are in a list?
How to replace strings in pandas column that are in a list?

Time:07-22

I have scrolled through the posts on this question and was unable to find an answer to my situation. I have a pandas dataframe with a list of company names, some of which are represented as their domain name.

df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])

I want to remove the domain extension from all the strings. The extensions are given in a list:

web_domains = ['.com', '.us']

The ollowing attepmt did not yield any results:

df['firm'].str.lower().replace(web_domains, '')

Can someone please help me out, and possibly also explain why my solution does not work?

CodePudding user response:

You need use regex=True for Series.replace since it matches exact string under regex=False.

For example a will only be replaced when target is a not ab nor bab

web_domains = ['\.com', '\.us'] # escape `.` for regex=True

df['removed'] = df['firm'].str.lower().replace(web_domains, '', regex=True)
print(df)

          firm    removed
0    amazon.us     amazon
1        pepsi      pepsi
2  YOUTUBE.COM    youtube
3    apple.inc  apple.inc

CodePudding user response:

Form a regex alternation from the domain endings, then use str.replace to remove them:

web_domains = ['com', 'us']
regex = r'\.(?:'   r'|'.join(web_domains)   r')$'
df["firm"] = df["firm"].str.replace(regex, '', regex=True)

CodePudding user response:

Use str.replace with a regular expression:

import re
import pandas as pd

web_domains = ['.com', '.us']
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])

regex_pattern = f"({ '|'.join(re.escape(wb) for wb in web_domains)})$"
res = df["firm"].str.replace(regex_pattern, "", regex=True, flags=re.IGNORECASE)
print(res)

Output

0       amazon
1        pepsi
2      YOUTUBE
3    apple.inc
Name: firm, dtype: object

The variable regex_pattern points to the string:

(\.com|\.us)$

it will match either .com or .us at the end of the string

CodePudding user response:

explain why my solution does not work?

pandas.Series.str.replace first argument should be

str or compiled regex

you have shoved list here, which is not allowed.

  • Related