Using beautifulsoup it is possible to do the following:
for heading in soup.find_all('h1'):
print(heading.text)
Top Rated Movies
However, is there a method to extract the tags themselves, given the text? A way of working backwards from the above example, something like:
soup.find_tag('Top Rated Movies')
h1
CodePudding user response:
There are quite a few ways.
If you know the exact text, you can use text
or string
argument like
tags = soup.find_all(True, string="Top Rated Movies")
for more ways to use the string
argument, like when you only know part of the text, you can check out the relevant section of the documentation.
Also, for partial text you can use lambda
tags = soup.find_all(lambda x: x.name and x.text and 'Top Rated Movies' in x.text)
or, using select
and the -soup-contains
selector
tags = soup.select(':-soup-contains("Top Rated Movies")')
Of course, with the partial-match methods, you'll end up getting the parent tags as well, but you can filter them out with
tags = [t for t in tags if not [p for p in t.parents if p in tags]]
or, if you're sure the target tag doesn't have any more tags nested within it,
tags = [t for t in tags if t.find() is None]
although, in that case, you could have used the :-soup-contains-own
selector and not needed to filter.
(Please note that you probably need to use the html5lib
parser for pseudo-classes like -soup-contains
.)
If you want the tag name specifically, follow any of the above with:
for t in tags: print(t.name)
CodePudding user response:
for tag in soup.findAll(True, text="Top Rated Movies"):
print(tag.name)