I am trying to scrape a website with multiple brackets. My plan is to have 3 varbiales (oem, model, leadtime) to generate the desired output. However, I cannot figure out how to scrape this webpage in 3 variables. Given I am new to python and beautfulsoup, I highly appreciate your feedback.
Desired output with 3 varibales and the command: print(oem, model, leadtime)
Renault, Mégane E-Tech, 12 Monate
Nissan, Ariya, 6 Monate
...
Volvo, XC90, 10-12 Monate
Output as of now:
Renault Mégane E-Tech12 Monate
Nissan Ariya6 Monate
Peugeot e-2086-7 Monate
KIA Sportage5-6 Monate6-7 Monate (Hybrid)
Jeep Compass3-5 Monate3-5 Monate (Hybrid)
VW Taigo3-6 Monate
...
XC9010-12 Monate
Code as of now:
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for card in overview.find_all('tbody'):
for model2 in card.find_all('tr'):
model = model2.text.replace('Angebote vergleichen', '')
#oem?-->this needs to be defined
#leadtime?--> this needs to defined
print(model)
CodePudding user response:
The brand name is inside h3
tag. You can get the parent with this approach .find_all("div", {"class": "expandable-content-container"})
from bs4 import BeautifulSoup
import requests
#Inputs/URLs to scrape:
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
for el in overview.find_all("div", {"class": "expandable-content-container"}):
header = el.find("h3").text.strip()
if not header.startswith("Top 10") and not header.endswith("?"):
for row in el.find_all("tr")[1:]:
model_monate = ", ".join(
list(map(lambda x: x.text, row.find_all("td")[:-1]))
)
print(f"{el.find('h3').text.strip()}, {model_monate}")
print("----")
CodePudding user response:
the parts of the car model info that you're trying to scrape are actually stored in separate td
tags, meaning, you can just access their index to get corresponding info, try the code below.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.carwow.de/neuwagen-lieferzeiten#gref").text
soup = BeautifulSoup(response, 'html.parser')
for tbody in soup.select('tbody'):
for tr in tbody:
brand = tr.select('td > a')[0].get('href').split('/')[3].capitalize()
model = tr.select('td > a')[0].get('href').split('/')[4].capitalize()
monate = tr.select('td')[1].getText(strip=True)
print(f'{brand}, {model}, {monate}')