Comparing results with Beautiful Soup in Python-CodePudding

I've got the following code that filters a particular search on an auction site. I can display the titles of each value & also the len of all returned values:

from bs4 import BeautifulSoup
import requests


url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))

for listing in listings:
    print(listing.text)

This prints out the following:

#print(len(listings))
3 

#for listing in listings:
#    print(listing.text)

 PRS. Ten Top Custom 24, faded Denim, Piezo. 
 PRS SE CUSTOM 22 
 PRS Tremonti SE *With Seymour Duncan Pickups*

I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you

CodePudding user response：

With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.

I can suggest 3 other methods. (The 3rd uses my preferred approach.)

Closing time

A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing

from datetime import date
import dateutil.parser 

def get_days_til_closing(lSoup):
    cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
    cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
    return (cTime.date() - date.today()).days

and then filter by the returned value

min_dtc = 3 # or as preferred

# your current code upto listings = soup.findAll....

new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')

for listing in new_listings: print(listing.text)

However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.

JSON list of Listing IDs

Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape

from bs4 import BeautifulSoup
import requests
import json

lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")

try:
    prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
    prev_listings = []
print(len(prev_listings), 'saved listings found')

soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
    l for l in listings if 
    l.get('href').split('/listing/')[1].split('?')[0] 
    not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')

for listing in new_listings:
    print(listing.select_one('div.tm-marketplace-search-card__title').text)

with open(lFilename, 'w') as f: 
    json.dump(prev_listings   [
        l.get('href').split('/listing/')[1].split('?')[0] 
        for l in new_listings
    ], f)

This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)

CSV Logging [including Listing IDs]

Instead of just saving the IDs, you can save pretty much all the details from each result

from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas

lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")

try:
    prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
    prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
    prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')

def get_listing_details(lSoup, prevList, lDate=date_today):
    selectorsRef = {
        'title': 'div.tm-marketplace-search-card__title',
        'location_time': 'div.tm-marketplace-search-card__location-and-time',
        'footer': 'div.tm-marketplace-search-card__footer',
    }
    
    lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
    lDets = {'listing_id': lId}

    for k, sel in selectorsRef.items(): 
        s = lSoup.select_one(sel)
        lDets[k] = None if s is None else s.text

    lDets['listing_link'] = 'https://www.trademe.co.nz/a/'   lSoup.get('href')
    lDets['new_listing'] = lId not in prevList
    lDets['last_scraped'] = lDate.isoformat()

    return lDets

soup = BeautifulSoup(url.text, "html.parser")

listings = [
    get_listing_details(s, prevIds) for s in
    soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')

for listing in new_listings: print(listing['title'])

prev_listings = [
    p for p in prev_listings if str(p['listing_id']) not in todaysIds
    and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings   listings).to_csv(lFilename, index=False)

You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.

CodePudding user response：

Fixed it with the following:

allGuitars = ["",]

latestGuitar = soup.select("#-title")[0].text.strip()

 if latestGuitar in allGuitars[0]:
    print("No change. The latest listing is still: "   allGuitars[0])
  elif not latestGuitar in allGuitars[0]:
    print("New listing detected! - "   latestGuitar)
    allGuitars.clear()
    allGuitars.insert(0, latestGuitar)