I have pandas column with hyperlinks and I want to extract only the name of the domain, excluding ".com", "http//", "www."
The following code works for most of my cases but there is one where it does not return the desired string:
docs['link_title'] = docs['hyperlink'].str.extract(r'(?<=\.)(.*?)(?=\.)')
Below are examples of hyperlinks and the results:
http://www.traveldailymedia.com/240881/qantas-launches-uk-agent-incentive/
-> "traveldailymedia"
https://www.instagram.com/p/BKDJcO-htRs/ -> "instagram"
But this is an example where I don't get the title of the domain:
http://dtinews.vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes.html
-> "vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes"
Because there is no leading dot (".") it does not get the name which is "dtinews".
I would appreciate help with the regex here or some alternative to my approach.
CodePudding user response:
You can use tldextract
:
import tldextract
import pandas as pd
docs = pd.DataFrame({'hyperlink':["http://www.traveldailymedia.com/240881/qantas-launches-uk-agent-incentive/","https://www.instagram.com/p/BKDJcO-htRs/","http://dtinews.vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes.html"]})
docs['link_title'] = docs['hyperlink'].apply(lambda x: tldextract.extract(x).domain)
Output:
>>> docs['link_title']
0 traveldailymedia
1 instagram
2 dtinews