Home > front end >  How to get children of a list with no attributes with BeatifulSoup?
How to get children of a list with no attributes with BeatifulSoup?

Time:11-26

Situation

I try to scrape the nested unordered list of 3 "Market drivers" from this HTML:

   <li>Drivers, Challenges, and Trends
    <ul>
     <li>Market drivers
      <ul>
       <li>Improvement in girth gear manufacturing technologies</li>
       <li>Expansion and installation of new cement plants</li>
       <li>Augmented demand from APAC</li>
      </ul>
     </li>
   <li>Market challenges
    <ul>
     <li>Increased demand for refurbished girth gear segments</li>

Issue #1:

The list "Market drivers" I'm looking for doesn't have any attributes, like class name or id, so just need to go by the text / string within it. All tutorials show how to find using classes, id's, etc.

Issue #2:

The children, i.e. the 3 list items, happen to be 3 in this page, but in other similar pages there may be 0, 4 or 7 or another number of them. So I'm looking to get all the children irrespective of how many there are (or none). I've found something on getting children using recursive=False and also some other instruction saying not to use findChildren after BS2.

Issue #3:

I tried using find_all_next, but tutorials don't tell me how to find next up to a defined point - it's always about getting ALL next. Whereas I could potentially use find_all_next if it had some stop at or until you find property.

Following code shows my try (but it doesn't work):

import requests
from bs4 import BeautifulSoup

url = 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-DNA-Microarray-30162580/'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

toc = soup.find("div", id="toc")
drivers = toc.find(string="Market drivers").findAll("li", recursive=False).text

print(drivers)

CodePudding user response:

While there is no example of expected output i would recommend the following approach with Beautiful Soup version 4.7.0 required

How to select?

Selecting an element by its own text and extract the text of all its children <li> you can go with css selectors and a list comprehension:

[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]

or in a for loop:

data = []

for x in toc.select('li:-soup-contains-own("Market drivers") li'):
    data.append(x.get_text(strip=True))  

print(data)  

Output:

['Improvement in girth gear manufacturing technologies', 'Expansion and installation of new cement plants', 'Augmented demand from APAC']
  • Related