Home > Mobile >  BeautifulSoup extract desired information in custome-made list from different html snippets
BeautifulSoup extract desired information in custome-made list from different html snippets

Time:11-06

Apologies in advance for such a long and basic question!

Given the following three html snippets which are part of a bigger part as follows:

<html>
 <body>
  <span _ngcontent-ont-c199="" >
   <span _ngcontent-ont-c199="" >
    <span _ngcontent-ont-c199="" translate="">
     nro
    </span>
    4 A.
   </span>
   <!-- -->
   <span _ngcontent-ont-c199="" >
    6.12.1939
   </span>
   <!-- -->
  </span>
  <span _ngcontent-ont-c199="" >
   , JR 10
  </span>
  <!-- -->
  <!-- -->
  <span _ngcontent-ont-c199="" >
   :
   <span _ngcontent-ont-c199="" translate="">
    sivu
   </span>
   1
  </span>
 </body>
</html>

and

<html>
 <body>
  <span _ngcontent-evu-c199="" >
   <!-- -->
   <span _ngcontent-evu-c199="" >
    1905
   </span>
   <!-- -->
  </span>
  <span _ngcontent-evu-c199="" >
   , Aksel Paul
  </span>
  <!-- -->
  <span _ngcontent-evu-c199="" >
   , Helsinki
  </span>
  <!-- -->
  <span _ngcontent-evu-c199="" >
   :
   <span _ngcontent-evu-c199="" translate="">
    page
   </span>
   63
  </span>
 </body>
</html>

and

<html>
 <body>
  <span _ngcontent-ejj-c199="" >
   22
  </span>
  <span _ngcontent-dna-c199="" >
   <span _ngcontent-dna-c199="" >
    <span _ngcontent-dna-c199="" translate="">
     nro
    </span>
    12 ZZ
   </span>
   <span _ngcontent-dna-c199="" >
    10.2016
   </span>
  </span>
  <span _ngcontent-ejj-c199="" >
   , Arbetarförlaget Ab
  </span>
  <!-- -->
  <span _ngcontent-ejj-c199="" >
   , Stockholm
  </span>
  <!-- -->
  <span _ngcontent-ejj-c199="" >
   :
   <span _ngcontent-ejj-c199="" translate="">
    sida
   </span>
   20
  </span>
 </body>
</html>

I would like to extract 6 different information (if available otherwise None) using a desired list which looks as follows:

desired_list = ["badge", "issue", "date", "publisher", "city", "page"]

So I have the following code (very inefficient using for loop):

desired_list = [None]*6 # initialize with [None, None, None, None, None, None]

soup = BeautifulSoup(html, "lxml") # html_1, html_2, html_3

fwb = soup.find("span", class_="font-weight-bold")
issue_date = fwb.select("span.ng-star-inserted") # always a list of 2 elements: ['nro XX extension', 'DD.MM.YYYY']

for el in issue_date:
    element = el.text.split()
    if "nro" in element:
        desired_list[1] = " ".join(element) # handling issue: nro XX extension
    desired_list[2] = " ".join(element) # handling date: DD.MM.YYYY
badge = soup.find("span", class_="badge badge-secondary ng-star-inserted")
if badge: desired_list[0] = " ".join(badge.text.split()) # handling badge

Currently, I can only extract info for first three components in my desired_list, namely, badge, issue, date.

[None, 'nro 4 A.', '6.12.1939', None, None, None] # html_1
[None, None, '1905', None, None, None]            # html_2
['22', 'nro 12 ZZ', '10.2016', None, None, None]  # html_3

Whereas, my desired list for each html should looks like this:

[None, 'nro 4 A.', '6.12.1939', 'JR 10', None, 'sivu 1']                     # html_1
[None, None, '1905', 'Aksel Paul', 'Helsinki', 'page 63']                    # html_2
['22', 'nro 12 ZZ', '10.2016', 'Arbetarförlaget Ab', 'Stockholm', 'sida 20'] # html_3

And I don't know how to manipulate my code to retrieve all 6 fields given the aforementioned html snippets since their occurrences are very stochastic meaning that it can happen some information is missing. I really appreciate if someone can recommend me smarter and more efficient way of handling this.

I am aware of soup.find_all("span", class_="ng-star-inserted"). However, the problem is that find_all does not always return a list with length of 6 to enumerate!

Cheers,

CodePudding user response:

I am not providing a solution but this is a clue. If you have a fixed structure and you don't have a way to distinguish between them. I suggest using soup.find_all() to extract all items with class "ng-star-inserted" and then add them to the list in the desired order:

You can print them all like this:

    result = soup.find_all("span", "ng-star-inserted")
    for position, element in enumerate(result):
        print(el)

That will iterate over tuples of (position, element) in the list. Then you need to figure out the location of the desired element, and put it into the appropriate position, for example:

    if position == you_need_to_figure_this_out:
        desired_list[desired_position] = element

If you can't map positions to elements, you need some heuristics to guess them.

"page" is easy if el.text.isdigit() then it is a "page"

"publisher" and "city" is harder, but it looks like there is always a "publisher", so the first string (without elements that are already processed in your question's code) will be a "publisher", and the second non-processed (if it exists) is a "city"

CodePudding user response:

You might just define a list of selectors corresponding to desired_list:

selectors = [
    'span.badge.badge-secondary.ng-star-inserted',                                             #badge 
    'span.font-weight-bold span.ng-star-inserted:has(span[translate])',                        #issue 
    'span.font-weight-bold span.ng-star-inserted:last-child',                                  #date 
    'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ")',                      #publisher 
    'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ")   span.ng-star-inserted:-soup-contains(", ")', #city 
    'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(":")',                       #page 
]

and then

desired_list = [None if s[0] is None else (
    s[0].get_text(' ', strip=True)[2:] if '-soup-contains' in s[1] 
    else s[0].get_text(' ', strip=True)
) for s in [(soup.select_one(sel), sel) for sel in selectors]]

should give you what you're looking for.

Please note that you might need to install and use html5lib parser (which can be a bit slower than lxml)

soup = BeautifulSoup(html, "html5lib") # html_1, html_2, html_3

otherwise, pseudoclasses like :has and :-soup-contains might raise errors or just return nothing. [Although, after installing html5lib, I noticed the selectors started working no matter which parser I used, including lxml...but I did need to install html5lib first]

  • Related