Home > Software engineering >  How to select and scrape specific texts out of a bunch <ul> and <li>?
How to select and scrape specific texts out of a bunch <ul> and <li>?

Time:05-15

I need to scrape "2015" and "09/09/2015" from the below link:

lacentrale.fr/auto-occasion-annonce-87102353714.html

But since there are many li and ul, I cant scrape the exact text. I used the below code Your help is highly appreciated.

from bs4 import BeautifulSoup 
soup = BeautifulSoup(HTML)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()

CodePudding user response:

Try:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}

url = "https://www.lacentrale.fr/auto-occasion-annonce-87102353714.html"

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

v1 = soup.select_one('.optionLabel:-soup-contains("Année")   span')
v2 = soup.select_one(
    '.optionLabel:-soup-contains("Mise en circulation")   span'
)

print(v1.text)
print(v2.text)

Prints:

2015
09/09/2015

CodePudding user response:

Fan of css selectors and :-soup-contains() as in @Andrejs answer mentioned. So just in case an alternative approach, if it comes to the point there are more options needed.

Generate a dict with all options pick the relevant value, by option label as key:

data = dict((e.button.text,e.find_next('span').text) for e in soup.select('.optionLabel'))

data lokks like:

{'Année': '2015', 'Mise en circulation': '09/09/2015', 'Contrôle technique': 'requis', 'Kilométrage compteur': '68 736 Km', 'Énergie': 'Electrique', 'Rechargeable': 'oui', 'Autonomie batterie': '190 Km', 'Capacité batterie': '22 kWh', 'Boîte de vitesse': 'automatique', 'Couleur extérieure': 'gris foncé metal', 'Couleur intérieure': 'cuir noir', 'Nombre de portes': '5', 'Nombre de places': '4', 'Garantie': '6 mois', 'Première main (déclaratif)': 'non', 'Nombre de propriétaires': '2', 'Puissance fiscale': '3 CV', 'Puissance din': '102 ch', 'Puissance moteur': '125 kW', "Crit'Air": '0', 'Émissions de CO2': '0 g/kmA', 'Norme Euro': 'EURO6', 'Prime à la conversion': ''}
Example
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'}
url = 'https://www.lacentrale.fr/auto-occasion-annonce-87102353714.html'

soup = BeautifulSoup(requests.get(url, headers=headers).text)

data = dict((e.button.text,e.find_next('span').text) for e in soup.select('.optionLabel'))

print(data['Année'], data['Mise en circulation'], sep='\n')
Output
2015
09/09/2015
  • Related