Home > Software design >  Problem finding link in text using regular expression
Problem finding link in text using regular expression

Time:03-09

I'm doing a site parsing and moved to the level of json formation. I have links in the text that I thought it would be nice to keep as well. To do this, I use a regular expression, but it does not work quite correctly.

Text:

  \r\n<li><a href="http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf">Regulations documents</a></li>\r\n</ul>\r\n'

Regex:

link = re.findall("(?P<url>https?://\S )", str(region))
print(link)

Result find:

['http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf">Regulations']

How can I change the regular expression to fix it a little and remove the extra word at the end?

CodePudding user response:

If it's always in a href attribute in the format you showed using " or ' you could use:

region = "  \r\n<li><a href=\'http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf\'>Regulations documents</a></li>\r\n</ul>\r\n'"
link = re.findall("https?://[^\"']*", region)
print(link)
> ['http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf']

Which matches any substring starting with https:// or http:// and everything afterwards until the first " or ' character. Should work since " and ' is escaped in URLs and therefore can not occur in the middle of the url.

Generally using regex to parse HTML yourself is a bad idea and makes life harder for you than it needs to be. Just use a parser and properly access the attributes, e.g. like this:

from bs4 import BeautifulSoup

soup = BeautifulSoup(region, "lxml")
link = soup.find('li').find('a')['href']  # finds the first <a> in the first <li> and returns the value of the attribute 'href'
print(link)
> http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf

Or if you have a document with many of those <li> tags:

for li in soup.findAll('li'):
    link = li.find('a')['href']
    print(link)
  • Related