I'm doing a site parsing and moved to the level of json formation. I have links in the text that I thought it would be nice to keep as well. To do this, I use a regular expression, but it does not work quite correctly.
Text:
\r\n<li><a href="http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf">Regulations documents</a></li>\r\n</ul>\r\n'
Regex:
link = re.findall("(?P<url>https?://\S )", str(region))
print(link)
Result find:
['http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf">Regulations']
How can I change the regular expression to fix it a little and remove the extra word at the end?
CodePudding user response:
If it's always in a href attribute in the format you showed using "
or '
you could use:
region = " \r\n<li><a href=\'http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf\'>Regulations documents</a></li>\r\n</ul>\r\n'"
link = re.findall("https?://[^\"']*", region)
print(link)
> ['http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf']
Which matches any substring starting with https://
or http://
and everything afterwards until the first "
or '
character. Should work since "
and '
is escaped in URLs and therefore can not occur in the middle of the url.
Generally using regex to parse HTML yourself is a bad idea and makes life harder for you than it needs to be. Just use a parser and properly access the attributes, e.g. like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(region, "lxml")
link = soup.find('li').find('a')['href'] # finds the first <a> in the first <li> and returns the value of the attribute 'href'
print(link)
> http://chechnya.gov.ru/wp-content/uploads/documents/226-8.pdf
Or if you have a document with many of those <li>
tags:
for li in soup.findAll('li'):
link = li.find('a')['href']
print(link)