Beautifulsoup issue in finding anchor tag-CodePudding

I am trying to capture a link in my python script. I have a variable holding the regex pattern.

I want to capture below link from the page HTML.

<a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a>

The code is:

parser = "lxml" 
next_regex = r'(.*?)NEXT(.*?)'
html_bodySoup = BeautifulSoup(urllib.request.urlopen(url), parser)
links = html_bodySoup.find_all('a', href = re.compile(nextpg_regex))

Can't find what is the problem, but it does not give me the link as desired. I tried other more accurate regex patterns as well.

CodePudding user response：

You do not need the regex here. You can simply check if the NEXT is in the node text.

You can use

links = html_bodySoup.find_all(lambda x: x.name=='a' and  'NEXT' in x.text)

Here, we search for any tag with a name and NEXT in the node text.

A Python test:

from bs4 import  BeautifulSoup
html = '<p><a  href="/department/office/pg2"> NEXT >> </a></p>'
parser = "lxml"
html_bodySoup = BeautifulSoup(html, parser)
html_bodySoup.find_all(lambda x: x.name=='a' and  'NEXT' in x.text)
# => [<a  href="/department/office/pg2"> NEXT &gt;&gt; </a>]

If you want to search for an exact word NEXT, then you can use a regex like this:

html_bodySoup.find_all(lambda x: x.name=='a' and re.search(r'\bNEXT\b', x.text))
# => [<a  href="/department/office/pg2"> NEXT &gt;&gt; </a>]

where re.search searches for a match anywhere inside a string and \bNEXT\b pattern makes sure the NEXT it finds is a whole word (thanks to word boundaries).

CodePudding user response：

You can also use -soup-contains to target that text. It does look like you could probably use just the class however (one of the multi-values). Some options shown below with the most descriptive not commented out:

from bs4 import BeautifulSoup as bs

html = '''<a  href="/department/office/pg2"> NEXT >> </a>'''
soup = bs(html, 'lxml')
# soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')
# soup.select_one('.pg-bton')
soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')