Home > Back-end >  Parse domain names from different URL formats
Parse domain names from different URL formats

Time:10-27

I have a dataframe with a column named Website that consists of string values. Here is a sample:

Website

www.trend-setter.com
gmail.com
78388383
www.yahoo.com
wis.pr
mail.yahoo.com
www.mail.yahoo.com
www.google.com

I want to parse the domain names out like the following (keeping the suffix), but default to the original value if the string is not a website or if the field is already parsed appropriately:

Website

trend-setter.com
gmail.com
78388383
yahoo.com
wis.pr
mail.yahoo.com
mail.yahoo.com
google.com

I have tried the following, but can't figure out how to make it default to the above:

import re
df['Website'].apply(lambda x: re.findall('www.([\w\-\.] )', x))

CodePudding user response:

If the aim is just to remove the www. prefix, you can use:

df['Website'].str.replace('^www\.', '', regex=True)

Output:

0    trend-setter.com
1           gmail.com
2            78388383
3           yahoo.com
4              wis.pr
5      mail.yahoo.com
6      mail.yahoo.com
7          google.com
Name: Website, dtype: object

CodePudding user response:

If you want a 100% regex solution, this works for me:

(?:www\.)?(?P<url>[\w\-] \.([\w\-] \.?) )
  • Related