Home > other >  Scraper does not go to next pages in python
Scraper does not go to next pages in python

Time:04-21

I'm trying to scrap a website https://lt.brcauto.eu/ and need to take at least 50 cars from there. So I go from main page to "car search page" and start scrape everything from the 1st. However, in one page there is only 21 car so when the cars end and parser should go to another page I get an error that list index out of range. This is how I'm trying to scrape:

import json
import requests
from bs4 import BeautifulSoup

mainURL = 'https://lt.brcauto.eu/'

req1 = requests.get(mainURL)
soup1 = BeautifulSoup(req1.text, 'lxml')

link = soup1.find('div', class_ = 'home-nav flex flex-wrap')
temp = link.findAll("a") # find search link
URL = (temp[1].get('href')   '/')

req2 = requests.get(URL)
soup2 = BeautifulSoup(req2.text, 'lxml')

page = soup2.find_all('li', class_ = 'page-item')[-2] # search pages till max ">"

cars_printed_counter = 0

for number in range(1, int(page.text)): #from 1 until max page
  req2 = requests.get(URL   '?page='   str(number)) #page url
  soup2 = BeautifulSoup(req2.text, 'lxml')

  if cars_printed_counter == 50:
      break # due faster execution

out = [] # holding all cars

for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):

    if cars_printed_counter == 50:
        break # after 5 cars

    Car_Title = single_car.find('h2', class_ = 'cars__title')
    Car_Specs = single_car.find('p', class_ = 'cars__subtitle')


    #print('\nCar number:', cars_printed_counter   1)
    #print(Car_Title.text)
    #print(Car_Specs.text)
    
    car = {}
    spl = Car_Specs.text.split(' | ')
    car["fuel"] = spl [1].split(" ")[1]
    car["Title"] = str(Car_Title.text)
    car["Year"] = int(spl [0])
    car["run"] = int(spl [3].split(" ")[0])
    car["type"] = spl [5]
    car["number"] = cars_printed_counter   1
    out.append(car)
    cars_printed_counter  = 1

print(json.dumps(out))
with open("outfile.json", "w") as f:
    f.write(json.dumps(out))

I have noticed that if I only print cars like this

for single_car in soup.find_all('div', class_ = 'cars-wrapper'):

    if cars_printed_counter == 50:
        break

    Car_Title = single_car.find('h2', class_ = 'cars__title')
    Car_Specs = single_car.find('p', class_ = 'cars__subtitle')
    Car_Price = single_car.find('div', class_ = 'w-full lg:w-auto cars-price text-right pt-1')

    print('\nCar number:', cars_printed_counter   1)

    print(Car_Title.text)
    print(Car_Specs.text)
    print(Car_Price.text)

    cars_printed_counter  = 1

Everything is okay. But once I want to write them into json format like this:

car = {}
    spl = Car_Specs.text.split(' | ')
    car["fuel"] = spl [1].split(" ")[1]
    car["Title"] = str(Car_Title.text)
    car["Year"] = int(spl [0])
    car["run"] = int(spl [3].split(" ")[0])
    car["type"] = spl [5]
    car["number"] = cars_printed_counter   1
    out.append(car)

    cars_printed_counter  = 1

print(json.dumps(out))
with open("outfile.json", "w") as f:
    f.write(json.dumps(out))

I get error that list index is out of range.

P.S. Or should I already use multithreading here?

CodePudding user response:

This solution worked for me:

        car = {}
        spl = Car_Specs.text.split(' | ')
        if spl[1].split(" ")[0] == 'Elektra': # break on Electric cars
            break
        car["fuel"] = spl [1].split(" ")[1]
        car["Title"] = str(Car_Title.text)
        car["Year"] = int(spl [0])
        car["run"] = int(spl [3].split(" ")[0])
        car["type"] = spl [5]
        car["number"] = cars_printed_counter   1
        out.append(car)
        cars_printed_counter  = 1

    print(json.dumps(out))
    with open("outfile.json", "w") as f:
        f.write(json.dumps(out))

So I added :

if spl[1].split(" ")[0] == 'Elektra':

break

Because while scraping the second element is fuel type which contains a liter. And when a scraper meets a Electric car dict can not add it because Electric cars have no liters. [0] is fuel type

CodePudding user response:

First at all - Put aside the thought of multithreading for a moment. There are other issues with your code:

  • As mentioned check your indentation in the question code, it do not make any sense in the moment, cause you are iterating all the sites, but only scrape the last one.

  • The issue that causes the IndexError: list index out of range

Print your spl and you will see the following issue - this car does not run on a combustion engine:

['2013', 'Elektra', 'Automatinė', '108030 km', '310 kW (422 AG)', 'Mėlyna']

Try to select index like you do car["fuel"] = spl [1].split(" ")[1] causes the error, instead do it like this (last element in list):

car["fuel"] = spl [1].split(" ")[-1]
Example

Your indents should look more like this to iterate all pages and store car information in out outside all the loops:

...
cars_printed_counter = 0

out = [] # holding all cars

for number in range(1, int(page.text)): #from 1 until max page
    req2 = requests.get(URL   '?page='   str(number)) #page url
    soup2 = BeautifulSoup(req2.text, 'lxml')

    if cars_printed_counter == 50:
        break # due faster execution

    for single_car in soup2.find_all('div', class_ = 'cars-wrapper'):

        if cars_printed_counter == 50:
            break # after 5 cars

        Car_Title = single_car.find('h2', class_ = 'cars__title')
        Car_Specs = single_car.find('p', class_ = 'cars__subtitle')

        car = {}
        spl = Car_Specs.text.split(' | ')
        print(spl)
        car["fuel"] = spl [1].split(" ")[-1]
        car["Title"] = str(Car_Title.text)
        car["Year"] = int(spl [0])
        car["run"] = int(spl [3].split(" ")[0])
        car["type"] = spl [5]
        car["number"] = cars_printed_counter   1
        out.append(car)
        cars_printed_counter  = 1

# print(json.dumps(out))
with open("outfile.json", "w") as f:
    f.write(json.dumps(out))
  • Related