I am trying to capture a link in my python script. I have a variable holding the regex pattern.
I want to capture below link from the page HTML.
<a class="pg-normal pg-bton" href="/department/office/pg2"> NEXT >> </a>
The code is:
parser = "lxml"
next_regex = r'(.*?)NEXT(.*?)'
html_bodySoup = BeautifulSoup(urllib.request.urlopen(url), parser)
links = html_bodySoup.find_all('a', href = re.compile(nextpg_regex))
Can't find what is the problem, but it does not give me the link as desired. I tried other more accurate regex patterns as well.
CodePudding user response:
You do not need the regex here. You can simply check if the NEXT
is in
the node text.
You can use
links = html_bodySoup.find_all(lambda x: x.name=='a' and 'NEXT' in x.text)
Here, we search for any tag with a
name and NEXT
in the node text.
A Python test:
from bs4 import BeautifulSoup
html = '<p><a href="/department/office/pg2"> NEXT >> </a></p>'
parser = "lxml"
html_bodySoup = BeautifulSoup(html, parser)
html_bodySoup.find_all(lambda x: x.name=='a' and 'NEXT' in x.text)
# => [<a href="/department/office/pg2"> NEXT >> </a>]
If you want to search for an exact word NEXT
, then you can use a regex like this:
html_bodySoup.find_all(lambda x: x.name=='a' and re.search(r'\bNEXT\b', x.text))
# => [<a href="/department/office/pg2"> NEXT >> </a>]
where re.search
searches for a match anywhere inside a string and \bNEXT\b
pattern makes sure the NEXT
it finds is a whole word (thanks to word boundaries).
CodePudding user response:
You can also use -soup-contains to target that text. It does look like you could probably use just the class however (one of the multi-values). Some options shown below with the most descriptive not commented out:
from bs4 import BeautifulSoup as bs
html = '''<a href="/department/office/pg2"> NEXT >> </a>'''
soup = bs(html, 'lxml')
# soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')
# soup.select_one('.pg-bton')
soup.select_one('.pg-bton[href*=department]:-soup-contains("NEXT")')