Home > Blockchain >  BeautifulSoup get text from tag searching by Title
BeautifulSoup get text from tag searching by Title

Time:11-14

I'm scrapping a webpage with python that provides different documents and I want to retrieve some information from them. The document gives the information in two ways, there's this one where it gives it like this: Company name: Company name which is solved in this question, and another one that goes like Title: and then all the text on a separate block of text, here's an example of this second html starting at div with class DocumentBody:

<div >
   <div >...</div>
   <div >...</div>
   <div >...</div>
   <div >...</div>
   <div >
      <p >...</p>
      <div >...</div>
      <div >
         <span >...</span>
         <span >Denomination:</span>
         <div >
            <p > </p>
            <p>Information about denomination</p>
            <p></p>
         </div>
      </div>
   </div>
</div>

At first I tried hardcoding the xpath to the text, but the problem is that the html from the documents might change, they are not always the same.

This is an example of what I made to get the denomination:

from lxml import etree

class LTED:
   def __init__(self, url, soup):
      if(not soup)
         soup = get_soup_from_url(url, "html.parser")
         dom = etree.HTML(str(soup))

      # case document it's updated and not a new one
      self.corrigenda = bool(soup.body.findAll(text="Corrigenda"))

      self.denomination = self.get_denomination(dom)

   def get_denomination(self, dom):
      if self.corrigenda:
         item = dom.xpath("//div[@class='DocumentBody']/div[7]/div[2]/div/p[2]")[0].text
      else:
         item = dom.xpath("//div[@class='DocumentBody']/div[6]/div[2]/div/p[2]")[0].text
      return item

As the xpath is hardcoded, this works the majority of the time, but in some cases it gets another text because the html has changed.

How should I retrieve the text in this case? Is there any way to get Information about denomination searching by Denomination?

In case you want to check the webpage, here's a link to an example I'm trying to scrape

CodePudding user response:

Link do not contain such Denomination but you can adapt and proceed like:

for e in soup.select('span:-soup-contains("Title:")   div'):
    print(e.get_text(strip=True))

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

Example

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://ted.europa.eu/udl?uri=TED:NOTICE:628602-2022:TEXT:EN:HTML&tabId=0',
                                  headers = {'User-Agent': 'Mozilla/5.0'}).text)

for e in soup.select('div.txtmark:-soup-contains("Official name:")'):
    print(e.next.split(':')[-1].strip())

for e in soup.select('span:-soup-contains("Title:")   div'):
    print(e.get_text(strip=True))

Output

KfW Bankengruppe
Vergabekammer Bund
Management of the PtX-Fund by the Power-to-X D&G GmbH (in formation)
  • Related