I have a dataframe with a column named Website
that consists of string values. Here is a sample:
Website
www.trend-setter.com
gmail.com
78388383
www.yahoo.com
wis.pr
mail.yahoo.com
www.mail.yahoo.com
www.google.com
I want to parse the domain names out like the following (keeping the suffix), but default to the original value if the string is not a website or if the field is already parsed appropriately:
Website
trend-setter.com
gmail.com
78388383
yahoo.com
wis.pr
mail.yahoo.com
mail.yahoo.com
google.com
I have tried the following, but can't figure out how to make it default to the above:
import re
df['Website'].apply(lambda x: re.findall('www.([\w\-\.] )', x))
CodePudding user response:
If the aim is just to remove the www.
prefix, you can use:
df['Website'].str.replace('^www\.', '', regex=True)
Output:
0 trend-setter.com
1 gmail.com
2 78388383
3 yahoo.com
4 wis.pr
5 mail.yahoo.com
6 mail.yahoo.com
7 google.com
Name: Website, dtype: object
CodePudding user response:
If you want a 100% regex solution, this works for me:
(?:www\.)?(?P<url>[\w\-] \.([\w\-] \.?) )