Home > Software engineering >  Extract specific text from scrape of <li> tag
Extract specific text from scrape of <li> tag

Time:12-09

Im scraping a web page and need to pull an item from a bulleted list

I cant use something like the code below because the length of the list changes on every page. The link that im using to test is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/

import requests
from bs4 import BeautifulSoup

url = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-    electric-range---white/'

response = requests.get(url)
repo =soup.find('div',class_="tabs").find_all('li')[2]
print(repo.text.strip())

The below code pulls the entire list but I need to extract the "MFG#" from the output

import requests
from bs4 import BeautifulSoup

url = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'

response = requests.get(url)
repo =soup.find('div',class_="tabs").find_all('li')
print(repo)

This is the ouput im trying to pull the "MFG#" from

[<li >
<a aria-controls="Features" aria-selected="true"  data-toggle="tab" href="#mobile-features" id="mobile-tab1" role="tab">
                        Features
                    </a>
</li>, <li >
<a aria-controls="Specifications" aria-selected="false"  data-toggle="tab" href="#mobile-specifications" id="mobile-tab2" role="tab">
                        Specifications
                    </a>
</li>, <li>2.9 cu. Ft. oven capacity</li>, <li>NEW Sensi-Temp Technology</li>, <li>Standard clean oven</li>, <li>(3) 6" 1250W &amp; (1) 8" 2400W coil heating element</li>, <li>2 oven racks</li>, <li>Includes broiler pan with grid</li>, <li>Lift-Up cooktop</li>, <li>Chrome drip bowls</li>, <li>ADA Compliant</li>, <li>41-7/8"H x 23-3/4"W x 26-5/8"D</li>, <li>White</li>, **<li>MFG# RAS240DMWW</li>**, <li>Power cord not included</li>, **<li>
                                Mfg:
                                RAS240DMWW**
                            </li>, <li>
                                Color:
                                White
                            </li>, <li>
                                Height:
                                41-7/8"
                            </li>, <li>
                                Width:
                                23-3/4"
                            </li>, <li>
                                Depth:
                                26-5/8"
                            </li>, <li>
                                Size:
                                3.0 cu ft.
                            </li>, <li>
                                Type:
                                Electric
                            </li>, <li>
                                ADA Compliant:
                                True
                            </li>, <li>
                                Page:
                                32
                            </li>]

CodePudding user response:

Just filter for MFG in text value.

For example:

import requests
from bs4 import BeautifulSoup

url = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'

response = requests.get(url)
soup = [
    li.getText() for li in
    BeautifulSoup(response.text, "lxml")
    .select_one(".Chadwell-Pages-CatalogEntry .tabs .tab-content ul")
    if "MFG" in li.getText()
]
print(soup)

Output:

['MFG# RAS240DMWW']

CodePudding user response:

You can use CSS selector with :-soup-contains() to search for a tag that contains specific text:

import requests
from bs4 import BeautifulSoup

url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")


mfg = soup.select_one("li:-soup-contains(Mfg)").text
print(mfg.split(":")[-1].strip())

Prints:

RAS240DMWW
  • Related