I've got a python script that scrapes the first page on an auction site. The page it's scraping is trademe.co.nz - similar to ebay/amazon etc. It's purpose is to scrape all listings on the first page - only if it's not in my database. It's working as expected with one caveat - it's only scraping the first 8 listings (regardless of trademe url) & then exits with code 0 in visual studio code. If I try to run it again it exits immediately as it thinks there are no new auction IDs. If a new listing gets added & I run the script again - it will add the new one.
from bs4 import BeautifulSoup
from time import sleep
import requests
import datetime
import sqlite3
# Standard for all scrapings
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
def mechanicalKeyboards():
url = "https://www.trademe.co.nz/a/marketplace/computers/peripherals/keyboards/mechanical/search?condition=used&sort_order=expirydesc"
category = "Mechanical Keyboards"
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
trademeLogo = "https://www.trademe.co.nz/images/frend/trademe-logo-no-tagline.png"
# getCode = requests.get(url).status_code
# print(getCode)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
listingContainer = soup.select(".tm-marketplace-search-card__wrapper")
conn = sqlite3.connect('trademe.db')
c = conn.cursor()
c.execute('''SELECT ID FROM trademe ORDER BY DateAdded DESC ''')
allResult = str(c.fetchall())
for listing in listingContainer:
title = listing.select("#-title")
location = listing.select("#-region")
auctionID = listing['data-aria-id'].split("-").pop()
fullListingURL = "https://www.trademe.co.nz/a/" auctionID
image = listing.select("picture img")
try:
buyNow = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price ng-star-inserted").text.strip()
except:
buyNow = "None"
try:
price = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price").text.strip()
except:
price = "None"
for t, l, i in zip(title, location, image):
if auctionID not in allResult:
print("Adding new data - " t.text)
c.execute(''' INSERT INTO trademe VALUES(?,?,?,?)''', (auctionID, t.text, dateAdded, fullListingURL))
conn.commit()
sleep(5)
I thought perhaps I was getting rate-limited, but I get a 200 status code & changing URLs work for the first 8 listings again. I had a look at the elements & can't see any changes after the 8th listing. I'm hoping someone could assist, thanks so much.
CodePudding user response:
When using requests.get(url)
to scrape a website with lazy-loaded content, it only return the HTML with images for the first 8 listings, causing the zip(title, location, image)
function to only yield 8 items since image
variable is empty list after the 8th listing in listingContainer
To properly scrape this type of website, I would recommended using tools such as Playwright or Selenium.