I have scrolled through the posts on this question and was unable to find an answer to my situation. I have a pandas dataframe with a list of company names, some of which are represented as their domain name.
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])
I want to remove the domain extension from all the strings. The extensions are given in a list:
web_domains = ['.com', '.us']
The ollowing attepmt did not yield any results:
df['firm'].str.lower().replace(web_domains, '')
Can someone please help me out, and possibly also explain why my solution does not work?
CodePudding user response:
You need use regex=True
for Series.replace
since it matches exact string under regex=False
.
For example a
will only be replaced when target is a
not ab
nor bab
web_domains = ['\.com', '\.us'] # escape `.` for regex=True
df['removed'] = df['firm'].str.lower().replace(web_domains, '', regex=True)
print(df)
firm removed
0 amazon.us amazon
1 pepsi pepsi
2 YOUTUBE.COM youtube
3 apple.inc apple.inc
CodePudding user response:
Form a regex alternation from the domain endings, then use str.replace
to remove them:
web_domains = ['com', 'us']
regex = r'\.(?:' r'|'.join(web_domains) r')$'
df["firm"] = df["firm"].str.replace(regex, '', regex=True)
CodePudding user response:
Use str.replace
with a regular expression:
import re
import pandas as pd
web_domains = ['.com', '.us']
df = pd.DataFrame(['amazon.us', 'pepsi', 'YOUTUBE.COM', 'apple.inc'], columns=['firm'])
regex_pattern = f"({ '|'.join(re.escape(wb) for wb in web_domains)})$"
res = df["firm"].str.replace(regex_pattern, "", regex=True, flags=re.IGNORECASE)
print(res)
Output
0 amazon
1 pepsi
2 YOUTUBE
3 apple.inc
Name: firm, dtype: object
The variable regex_pattern
points to the string:
(\.com|\.us)$
it will match either .com
or .us
at the end of the string
CodePudding user response:
explain why my solution does not work?
pandas.Series.str.replace
first argument should be
str or compiled regex
you have shoved list here, which is not allowed.