Home > Software engineering >  Beautifulsoup issue in finding anchor tag
Beautifulsoup issue in finding anchor tag

Time:10-24

I am trying to capture a link in my python script. I have a variable holding the regex pattern.

I want to capture below link from the page HTML.

<a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a>

The code is:

parser = "lxml" 
next_regex = r'(.*?)NEXT(.*?)'
html_bodySoup = BeautifulSoup(urllib.request.urlopen(url), parser)
links = html_bodySoup.find_all('a', href = re.compile(nextpg_regex))

Can't find what is the problem, but it does not give me the link as desired. I tried other more accurate regex patterns as well.

CodePudding user response:

You do not need the regex here. You can simply check if the NEXT is in the node text.

You can use

links = html_bodySoup.find_all(lambda x: x.name=='a' and  'NEXT' in x.text)

Here, we search for any tag with a name and NEXT in the node text.

A Python test:

from bs4 import  BeautifulSoup
html = '<p><a  href="/department/office/pg2"> NEXT >> </a></p>'
parser = "lxml"
html_bodySoup = BeautifulSoup(html, parser)
html_bodySoup.find_all(lambda x: x.name=='a' and  'NEXT' in x.text)
# => [<a  href="/department/office/pg2"> NEXT &gt;&gt; </a>]

If you want to search for an exact word NEXT, then you can use a regex like this:

html_bodySoup.find_all(lambda x: x.name=='a' and re.search(r'\bNEXT\b', x.text))
# => [<a  href="/department/office/pg2"> NEXT &gt;&gt; </a>]

where re.search searches for a match anywhere inside a string and \bNEXT\b pattern makes sure the NEXT it finds is a whole word (thanks to word boundaries).

CodePudding user response:

You can also use -soup-contains to target that text. It does look like you could probably use just the class however (one of the multi-values). Some options shown below with the most descriptive not commented out:

from bs4 import BeautifulSoup as bs

html = '''<a  href="/department/office/pg2"> NEXT >> </a>'''
soup = bs(html, 'lxml')
# soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')
# soup.select_one('.pg-bton')
soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')
  • Related