Home > front end >  How to webscrape multiple items in <tr> bracket and split them in 3 variables with python?
How to webscrape multiple items in <tr> bracket and split them in 3 variables with python?

Time:08-05

I am trying to scrape a website with multiple brackets. My plan is to have 3 varbiales (oem, model, leadtime) to generate the desired output. However, I cannot figure out how to scrape this webpage in 3 variables. Given I am new to python and beautfulsoup, I highly appreciate your feedback.

Desired output with 3 varibales and the command: print(oem, model, leadtime)

Renault, Mégane E-Tech, 12 Monate
Nissan, Ariya, 6 Monate
...
Volvo, XC90, 10-12 Monate

Output as of now:

Renault Mégane E-Tech12 Monate
Nissan Ariya6 Monate
Peugeot e-2086-7 Monate
KIA Sportage5-6 Monate6-7 Monate (Hybrid)
Jeep Compass3-5 Monate3-5 Monate (Hybrid)
VW Taigo3-6 Monate
...
XC9010-12 Monate

Code as of now:

from bs4 import BeautifulSoup
import requests


#Inputs/URLs to scrape: 
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()

for card in overview.find_all('tbody'):
    for model2 in card.find_all('tr'):
        model = model2.text.replace('Angebote vergleichen', '')
        #oem?-->this needs to be defined
        #leadtime?--> this needs to defined
        print(model)

CodePudding user response:

The brand name is inside h3 tag. You can get the parent with this approach .find_all("div", {"class": "expandable-content-container"})

from bs4 import BeautifulSoup
import requests


#Inputs/URLs to scrape: 
URL = ('https://www.carwow.de/neuwagen-lieferzeiten#gref')
(response := requests.get(URL)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()

for el in overview.find_all("div", {"class": "expandable-content-container"}):
    header = el.find("h3").text.strip()
    if not header.startswith("Top 10") and not header.endswith("?"):
        for row in el.find_all("tr")[1:]:
            model_monate = ", ".join(
                list(map(lambda x: x.text, row.find_all("td")[:-1]))
            )
            print(f"{el.find('h3').text.strip()}, {model_monate}")
        print("----")

CodePudding user response:

the parts of the car model info that you're trying to scrape are actually stored in separate td tags, meaning, you can just access their index to get corresponding info, try the code below.

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.carwow.de/neuwagen-lieferzeiten#gref").text
soup = BeautifulSoup(response, 'html.parser')

for tbody in soup.select('tbody'):
    for tr in tbody:
        brand = tr.select('td > a')[0].get('href').split('/')[3].capitalize()
        model = tr.select('td > a')[0].get('href').split('/')[4].capitalize()
        monate = tr.select('td')[1].getText(strip=True)
        print(f'{brand}, {model}, {monate}')
  • Related