Situation
I try to scrape the nested unordered list of 3 "Market drivers" from this HTML:
<li>Drivers, Challenges, and Trends
<ul>
<li>Market drivers
<ul>
<li>Improvement in girth gear manufacturing technologies</li>
<li>Expansion and installation of new cement plants</li>
<li>Augmented demand from APAC</li>
</ul>
</li>
<li>Market challenges
<ul>
<li>Increased demand for refurbished girth gear segments</li>
Issue #1:
The list "Market drivers" I'm looking for doesn't have any attributes, like class
name
or id
, so just need to go by the text
/ string
within it. All tutorials show how to find using classes, id's, etc.
Issue #2:
The children
, i.e. the 3 list items, happen to be 3 in this page, but in other similar pages there may be 0, 4 or 7 or another number of them. So I'm looking to get all the children irrespective of how many there are (or none). I've found something on getting children using recursive=False
and also some other instruction saying not to use findChildren
after BS2
.
Issue #3:
I tried using find_all_next
, but tutorials don't tell me how to find next up to a defined point - it's always about getting ALL next. Whereas I could potentially use find_all_next
if it had some stop at or until you find property.
Following code shows my try (but it doesn't work):
import requests
from bs4 import BeautifulSoup
url = 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-DNA-Microarray-30162580/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
drivers = toc.find(string="Market drivers").findAll("li", recursive=False).text
print(drivers)
CodePudding user response:
While there is no example of expected output i would recommend the following approach with Beautiful Soup version 4.7.0 required
How to select?
Selecting an element by its own text and extract the text of all its children <li>
you can go with css selectors
and a list comprehension
:
[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
or in a for loop:
data = []
for x in toc.select('li:-soup-contains-own("Market drivers") li'):
data.append(x.get_text(strip=True))
print(data)
Output:
['Improvement in girth gear manufacturing technologies', 'Expansion and installation of new cement plants', 'Augmented demand from APAC']