I'm trying to scrap a website https://lt.brcauto.eu/
and need to take at least 50 cars from there. So I go from main page to "car search page" and start scrape everything from the 1st. However, in one page there is only 21 car so when the cars end and parser should go to another page I get an error that list index out of range
. This is how I'm trying to scrape:
import json
import requests
from bs4 import BeautifulSoup
mainURL = 'https://lt.brcauto.eu/'
req1 = requests.get(mainURL)
soup1 = BeautifulSoup(req1.text, 'lxml')
link = soup1.find('div', class_ = 'home-nav flex flex-wrap')
temp = link.findAll("a") # find search link
URL = (temp[1].get('href') '/')
req2 = requests.get(URL)
soup2 = BeautifulSoup(req2.text, 'lxml')
page = soup2.find_all('li', class_ = 'page-item')[-2] # search pages till max ">"
cars_printed_counter = 0
for number in range(1, int(page.text)): #from 1 until max page
req2 = requests.get(URL '?page=' str(number)) #page url
soup2 = BeautifulSoup(req2.text, 'lxml')
if cars_printed_counter == 50:
break # due faster execution
out = [] # holding all cars
for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):
if cars_printed_counter == 50:
break # after 5 cars
Car_Title = single_car.find('h2', class_ = 'cars__title')
Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
#print('\nCar number:', cars_printed_counter 1)
#print(Car_Title.text)
#print(Car_Specs.text)
car = {}
spl = Car_Specs.text.split(' | ')
car["fuel"] = spl [1].split(" ")[1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))
I have noticed that if I only print cars like this
for single_car in soup.find_all('div', class_ = 'cars-wrapper'):
if cars_printed_counter == 50:
break
Car_Title = single_car.find('h2', class_ = 'cars__title')
Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
Car_Price = single_car.find('div', class_ = 'w-full lg:w-auto cars-price text-right pt-1')
print('\nCar number:', cars_printed_counter 1)
print(Car_Title.text)
print(Car_Specs.text)
print(Car_Price.text)
cars_printed_counter = 1
Everything is okay. But once I want to write them into json format like this:
car = {}
spl = Car_Specs.text.split(' | ')
car["fuel"] = spl [1].split(" ")[1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))
I get error that list index is out of range.
P.S. Or should I already use multithreading here?
CodePudding user response:
This solution worked for me:
car = {}
spl = Car_Specs.text.split(' | ')
if spl[1].split(" ")[0] == 'Elektra': # break on Electric cars
break
car["fuel"] = spl [1].split(" ")[1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))
So I added :
if spl[1].split(" ")[0] == 'Elektra':
break
Because while scraping the second element is fuel type which contains a liter. And when a scraper meets a Electric car dict
can not add it because Electric cars have no liters. [0] is fuel type
CodePudding user response:
First at all - Put aside the thought of multithreading for a moment. There are other issues with your code:
As mentioned check your indentation in the question code, it do not make any sense in the moment, cause you are iterating all the sites, but only scrape the last one.
The issue that causes the
IndexError: list index out of range
Print your spl
and you will see the following issue - this car does not run on a combustion engine:
['2013', 'Elektra', 'Automatinė', '108030 km', '310 kW (422 AG)', 'Mėlyna']
Try to select index like you do car["fuel"] = spl [1].split(" ")[1]
causes the error, instead do it like this (last element in list):
car["fuel"] = spl [1].split(" ")[-1]
Example
Your indents should look more like this to iterate all pages and store car information in out
outside all the loops:
...
cars_printed_counter = 0
out = [] # holding all cars
for number in range(1, int(page.text)): #from 1 until max page
req2 = requests.get(URL '?page=' str(number)) #page url
soup2 = BeautifulSoup(req2.text, 'lxml')
if cars_printed_counter == 50:
break # due faster execution
for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):
if cars_printed_counter == 50:
break # after 5 cars
Car_Title = single_car.find('h2', class_ = 'cars__title')
Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
car = {}
spl = Car_Specs.text.split(' | ')
print(spl)
car["fuel"] = spl [1].split(" ")[-1]
car["Title"] = str(Car_Title.text)
car["Year"] = int(spl [0])
car["run"] = int(spl [3].split(" ")[0])
car["type"] = spl [5]
car["number"] = cars_printed_counter 1
out.append(car)
cars_printed_counter = 1
# print(json.dumps(out))
with open("outfile.json", "w") as f:
f.write(json.dumps(out))