Home > Software engineering >  What is going on? (Attemp at scraping multiple pages)
What is going on? (Attemp at scraping multiple pages)

Time:11-13

url = "https://www.gumtree.com/search?search_category=all&q=ferrari"

while url:

    response = requests.get(url)

    soup = BeautifulSoup(response.text, "html.parser")

    name = soup.find_all("div", class_="h3-responsive")

    price = soup.find_all("strong", "h3-responsive")

    next_page = soup.select_one("li.pagination-page>a")

    for price,name in zip(name,price):
        print(name.text,price.text)

    if next_page:
        next_url = next_page.get("href")
        url = urljoin(url,next_url)
    else:
        url = None

Nothing is printing for some reason? I've had it running for 5 minutes and still nothing. I've had no error code either so I'm geuinely confused here? If anyone would like to fix this script please do so, if you didn't realize the script is supposed to scrape the name & price as well as every other page/s name and price.

If someone could please re-do the script, and make it also scrape all the other pages, because I have been having a hard time figuring out how to do this.

I've taken some suggestions in and done some editing but the script still won't work.

CodePudding user response:

Here is a way to get those listings (didn't go for an infinite loop, you're welcome to increase page count and look up for the existence of an element in page (like next page url, etc) if you want). There are 10 pages:

import requests
from bs4 import BeautifulSoup as bs
from tqdm import tqdm
import pandas as pd
import time as t

headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

s = requests.Session()
s.headers.update(headers)

big_list = []
for x in tqdm(range(1, 11)):
    r = s.get(f'https://www.gumtree.com/search?search_category=cars&search_location=uk&q=ferrari&page={x}')
    soup = bs(r.text, 'html.parser')
    cards = soup.select('div[]')
    for c in cards:
        title = c.select_one('h2[]').text.strip()
        location = c.select_one('div[]').text.strip()
        price = c.select_one('span[]').text.strip()
        big_list.append((title, location, price))
    t.sleep(3)
df = pd.DataFrame(big_list, columns = ['title', 'location', 'price'])
print(df)
df.to_csv('absurd_cars_uk.csv')

This will print out the dataframe, and also save it to disk, as csv:

    title   location    price
0   2011 11 FERRARI CALIFORNIA 4.3 2 PLUS 2 Auto 4...   Liskeard, Cornwall  £77,500
1   2011 61 FERRARI FF 6.3 V12 4WD 660bhp with ver...   Liskeard, Cornwall  £88,500
2   2020 Ferrari 812 Superfast 6.5 V12 F1 DCT Euro...   Huthwaite, Nottinghamshire  £294,980
3   FERRARI F430 SPIDER Black AUTO Petrol, 2007 Spalding, Lincolnshire  £110,000
4   1982 Ferrari 308 GTSi Coupe Petrol Manual   Newark, Nottinghamshire £73,000
... ... ... ...
130 2001 (X) FERRARI 360 3.6 MODENA F1 2DR  Dinnington, South Yorkshire £64,880
131 2012 Ferrari 458 4.5 Italia Auto Seq 2dr COUPE...   St Albans, Hertfordshire    £139,995
132 1997 Ferrari 456 GTA - 6700 miles   Ripon, North Yorkshire  £59,995
133 2021 Ferrari Roma 3.8T V8 F1 DCT Euro 6 (s/s) ...   York, North Yorkshire   £194,995
134 2015 Ferrari California 3.8 V8 T F1 DCT Euro 6...   Loughborough, Leicestershire    £99,950
135 rows × 3 columns
  • Related