Home > other >  Extract only title from hyperlink in pandas column
Extract only title from hyperlink in pandas column

Time:04-22

I have pandas column with hyperlinks and I want to extract only the name of the domain, excluding ".com", "http//", "www."

The following code works for most of my cases but there is one where it does not return the desired string:

docs['link_title'] = docs['hyperlink'].str.extract(r'(?<=\.)(.*?)(?=\.)')

Below are examples of hyperlinks and the results:

http://www.traveldailymedia.com/240881/qantas-launches-uk-agent-incentive/
-> "traveldailymedia"

https://www.instagram.com/p/BKDJcO-htRs/ -> "instagram"

But this is an example where I don't get the title of the domain:

http://dtinews.vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes.html
-> "vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes"

Because there is no leading dot (".") it does not get the name which is "dtinews".

I would appreciate help with the regex here or some alternative to my approach.

CodePudding user response:

You can use tldextract:

import tldextract
import pandas as pd
docs = pd.DataFrame({'hyperlink':["http://www.traveldailymedia.com/240881/qantas-launches-uk-agent-incentive/","https://www.instagram.com/p/BKDJcO-htRs/","http://dtinews.vn/en/news/018/46981/vietnam-to-buy-40-airbus-planes.html"]})
docs['link_title'] = docs['hyperlink'].apply(lambda x: tldextract.extract(x).domain)

Output:

>>> docs['link_title']
0    traveldailymedia
1           instagram
2             dtinews
  • Related