beautifulsoup - obtaining the tag and it's parent from the text inside?-CodePudding

Using beautifulsoup it is possible to do the following:

for heading in soup.find_all('h1'):
   print(heading.text)

Top Rated Movies

However, is there a method to extract the tags themselves, given the text? A way of working backwards from the above example, something like:

soup.find_tag('Top Rated Movies')

h1

CodePudding user response：

There are quite a few ways.

If you know the exact text, you can use text or string argument like

tags = soup.find_all(True, string="Top Rated Movies")

for more ways to use the string argument, like when you only know part of the text, you can check out the relevant section of the documentation.

Also, for partial text you can use lambda

tags = soup.find_all(lambda x: x.name and x.text and 'Top Rated Movies' in x.text)

or, using select and the -soup-contains selector

tags = soup.select(':-soup-contains("Top Rated Movies")')

Of course, with the partial-match methods, you'll end up getting the parent tags as well, but you can filter them out with

tags = [t for t in tags if not [p for p in t.parents if p in tags]]

or, if you're sure the target tag doesn't have any more tags nested within it,

tags = [t for t in tags if t.find() is None]

although, in that case, you could have used the :-soup-contains-own selector and not needed to filter.

(Please note that you probably need to use the html5lib parser for pseudo-classes like -soup-contains.)

If you want the tag name specifically, follow any of the above with:

for t in tags: print(t.name)

CodePudding user response：

for tag in soup.findAll(True, text="Top Rated Movies"):
    print(tag.name)