Apologies in advance for such a long and basic question!
Given the following three html
snippets which are part of a bigger part as follows:
<html>
<body>
<span _ngcontent-ont-c199="" >
<span _ngcontent-ont-c199="" >
<span _ngcontent-ont-c199="" translate="">
nro
</span>
4 A.
</span>
<!-- -->
<span _ngcontent-ont-c199="" >
6.12.1939
</span>
<!-- -->
</span>
<span _ngcontent-ont-c199="" >
, JR 10
</span>
<!-- -->
<!-- -->
<span _ngcontent-ont-c199="" >
:
<span _ngcontent-ont-c199="" translate="">
sivu
</span>
1
</span>
</body>
</html>
and
<html>
<body>
<span _ngcontent-evu-c199="" >
<!-- -->
<span _ngcontent-evu-c199="" >
1905
</span>
<!-- -->
</span>
<span _ngcontent-evu-c199="" >
, Aksel Paul
</span>
<!-- -->
<span _ngcontent-evu-c199="" >
, Helsinki
</span>
<!-- -->
<span _ngcontent-evu-c199="" >
:
<span _ngcontent-evu-c199="" translate="">
page
</span>
63
</span>
</body>
</html>
and
<html>
<body>
<span _ngcontent-ejj-c199="" >
22
</span>
<span _ngcontent-dna-c199="" >
<span _ngcontent-dna-c199="" >
<span _ngcontent-dna-c199="" translate="">
nro
</span>
12 ZZ
</span>
<span _ngcontent-dna-c199="" >
10.2016
</span>
</span>
<span _ngcontent-ejj-c199="" >
, Arbetarförlaget Ab
</span>
<!-- -->
<span _ngcontent-ejj-c199="" >
, Stockholm
</span>
<!-- -->
<span _ngcontent-ejj-c199="" >
:
<span _ngcontent-ejj-c199="" translate="">
sida
</span>
20
</span>
</body>
</html>
I would like to extract 6 different information (if available otherwise None
) using a desired list which looks as follows:
desired_list = ["badge", "issue", "date", "publisher", "city", "page"]
So I have the following code (very inefficient using for loop):
desired_list = [None]*6 # initialize with [None, None, None, None, None, None]
soup = BeautifulSoup(html, "lxml") # html_1, html_2, html_3
fwb = soup.find("span", class_="font-weight-bold")
issue_date = fwb.select("span.ng-star-inserted") # always a list of 2 elements: ['nro XX extension', 'DD.MM.YYYY']
for el in issue_date:
element = el.text.split()
if "nro" in element:
desired_list[1] = " ".join(element) # handling issue: nro XX extension
desired_list[2] = " ".join(element) # handling date: DD.MM.YYYY
badge = soup.find("span", class_="badge badge-secondary ng-star-inserted")
if badge: desired_list[0] = " ".join(badge.text.split()) # handling badge
Currently, I can only extract info for first three components in my desired_list
, namely, badge
, issue
, date
.
[None, 'nro 4 A.', '6.12.1939', None, None, None] # html_1
[None, None, '1905', None, None, None] # html_2
['22', 'nro 12 ZZ', '10.2016', None, None, None] # html_3
Whereas, my desired list for each html
should looks like this:
[None, 'nro 4 A.', '6.12.1939', 'JR 10', None, 'sivu 1'] # html_1
[None, None, '1905', 'Aksel Paul', 'Helsinki', 'page 63'] # html_2
['22', 'nro 12 ZZ', '10.2016', 'Arbetarförlaget Ab', 'Stockholm', 'sida 20'] # html_3
And I don't know how to manipulate my code to retrieve all 6 fields given the aforementioned html
snippets since their occurrences are very stochastic meaning that it can happen some information is missing. I really appreciate if someone can recommend me smarter and more efficient way of handling this.
I am aware of soup.find_all("span", class_="ng-star-inserted")
. However, the problem is that find_all
does not always return a list with length of 6 to enumerate!
Cheers,
CodePudding user response:
I am not providing a solution but this is a clue. If you have a fixed structure and you don't have a way to distinguish between them. I suggest using soup.find_all() to extract all items with class "ng-star-inserted" and then add them to the list in the desired order:
You can print them all like this:
result = soup.find_all("span", "ng-star-inserted")
for position, element in enumerate(result):
print(el)
That will iterate over tuples of (position, element) in the list. Then you need to figure out the location of the desired element, and put it into the appropriate position, for example:
if position == you_need_to_figure_this_out:
desired_list[desired_position] = element
If you can't map positions to elements, you need some heuristics to guess them.
"page" is easy if el.text.isdigit()
then it is a "page"
"publisher" and "city" is harder, but it looks like there is always a "publisher", so the first string (without elements that are already processed in your question's code) will be a "publisher", and the second non-processed (if it exists) is a "city"
CodePudding user response:
You might just define a list of selectors corresponding to desired_list
:
selectors = [
'span.badge.badge-secondary.ng-star-inserted', #badge
'span.font-weight-bold span.ng-star-inserted:has(span[translate])', #issue
'span.font-weight-bold span.ng-star-inserted:last-child', #date
'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ")', #publisher
'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(", ") span.ng-star-inserted:-soup-contains(", ")', #city
'span.font-weight-bold ~ span.ng-star-inserted:-soup-contains(":")', #page
]
and then
desired_list = [None if s[0] is None else (
s[0].get_text(' ', strip=True)[2:] if '-soup-contains' in s[1]
else s[0].get_text(' ', strip=True)
) for s in [(soup.select_one(sel), sel) for sel in selectors]]
should give you what you're looking for.
Please note that you might need to install and use html5lib parser (which can be a bit slower than lxml)
soup = BeautifulSoup(html, "html5lib") # html_1, html_2, html_3
otherwise, pseudoclasses like :has
and :-soup-contains
might raise errors or just return nothing. [Although, after installing html5lib, I noticed the selectors started working no matter which parser I used, including lxml...but I did need to install html5lib first]