Home > Software engineering >  Beautifulsoup - Python For loop only runs 8 times then exits with code 0 in visual studio code
Beautifulsoup - Python For loop only runs 8 times then exits with code 0 in visual studio code

Time:02-01

I've got a python script that scrapes the first page on an auction site. The page it's scraping is trademe.co.nz - similar to ebay/amazon etc. It's purpose is to scrape all listings on the first page - only if it's not in my database. It's working as expected with one caveat - it's only scraping the first 8 listings (regardless of trademe url) & then exits with code 0 in visual studio code. If I try to run it again it exits immediately as it thinks there are no new auction IDs. If a new listing gets added & I run the script again - it will add the new one.

from bs4 import BeautifulSoup
from time import sleep
import requests
import datetime
import sqlite3

# Standard for all scrapings
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")

def mechanicalKeyboards():

  url = "https://www.trademe.co.nz/a/marketplace/computers/peripherals/keyboards/mechanical/search?condition=used&sort_order=expirydesc"
  category = "Mechanical Keyboards"  
  dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
  trademeLogo = "https://www.trademe.co.nz/images/frend/trademe-logo-no-tagline.png"
  
  # getCode = requests.get(url).status_code
  # print(getCode)
  
  r = requests.get(url)
  soup = BeautifulSoup(r.text, "html.parser")
  
  listingContainer = soup.select(".tm-marketplace-search-card__wrapper")
  conn = sqlite3.connect('trademe.db')
  c = conn.cursor() 
  c.execute('''SELECT ID FROM trademe ORDER BY DateAdded DESC ''')
  allResult = str(c.fetchall())
    
  for listing in listingContainer:
    title = listing.select("#-title")
    location = listing.select("#-region")
    auctionID = listing['data-aria-id'].split("-").pop()
    fullListingURL = "https://www.trademe.co.nz/a/"   auctionID
    image = listing.select("picture img")
    
    try:
      buyNow = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price ng-star-inserted").text.strip()
    except:
      buyNow = "None"
    
    try:
      price = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price").text.strip()
    except:
      price = "None"

    for t, l, i in zip(title, location, image):
      if auctionID not in allResult:
        print("Adding new data - "   t.text)
        c.execute(''' INSERT INTO trademe VALUES(?,?,?,?)''', (auctionID, t.text, dateAdded, fullListingURL))
        conn.commit()
        sleep(5)

I thought perhaps I was getting rate-limited, but I get a 200 status code & changing URLs work for the first 8 listings again. I had a look at the elements & can't see any changes after the 8th listing. I'm hoping someone could assist, thanks so much.

CodePudding user response:

When using requests.get(url) to scrape a website with lazy-loaded content, it only return the HTML with images for the first 8 listings, causing the zip(title, location, image) function to only yield 8 items since image variable is empty list after the 8th listing in listingContainer

To properly scrape this type of website, I would recommended using tools such as Playwright or Selenium.

  • Related