Scraping Book rank from Amazon Book page — optimizing code-CodePudding

I developed a function to get the book rank from an Amazon book page, but I'm not extremely satisfied by it. It would be great to know if this can be optimised to collect the rank in a more efficient way (by efficient I was thinking perhaps it's possible to use something like "if string contains" though I have not been successful here). Please find the code below:

def scrap_rank_amz(link):
    # Step 1 — Get URL and content
    url = link
    request = Request(url, headers={"User-agent": "Mozilla/5.0"})
    html = urlopen(request)

    # Step 2 — Create BeautifulSoup Instance to get elements from HTML
    soup = BeautifulSoup(html, "html.parser")

    # Step 3 — Collect the info from the block where we will find the rank
    soup_rank = soup.find_all(
        class_="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list",
        limit=2,
    )

    # Step 4 — Obtain specifically the items where we'll find the rank
    soup_rank_detail = [i.find(class_="a-list-item").get_text(" ") for i in soup_rank]

    # Step 5 — Obtain the rank
    soup_rank_detail_lv2 = soup_rank_detail[1][24:32]

    # Step 6 — Return rank value
    return soup_rank_detail_lv2

You can find an example of a link to be used as follows: https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3

Thanks a lot for your time!

Sara

CodePudding user response：

As your question is unclear,So the following output is related to rank according the webpage that you laydown here.

Example:

from bs4 import BeautifulSoup
import requests 

cookies = {'cookie': 'session-id=131-6404291-8225156'}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}
url = 'https://www.amazon.com/dp/B078SZLXB3'
req = requests.get(url,headers=headers,cookies=cookies)
print(req)
soup = BeautifulSoup(req.text, "lxml")

r = soup.select_one('span:-soup-contains("Best Sellers Rank:")').parent
t = r.text
rank = t.split('#')[1:]
print(rank)

Output:

['43,847 in Kindle Store (See Top 100 in Kindle Store)   ', '296 in Vampire Mysteries  ', '323 in Werewolf & Shifter Mysteries  ', '432 in Cozy Culinary Mystery  ']

CodePudding user response：

Let's say all the pages you want to scrap have a Kindle ranking, then you can use a simple regexp.

import re

from typing import Optional
from urllib.request import urlopen, Request

def get_kindle_rank(link: str) -> Optional[str]:
    request = Request(link, headers={"User-agent": "Mozilla/5.0"})
    html = urlopen(request).read().decode()
    
    regexp_match = re.search(r"#([\d,] ) in Kindle", html)
    if regexp_match:
        return regexp_match.group(1) 
    else:
        return None

get_kindle_rank("https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3")
# '43,847'

However, it won't be much faster as most of the runtime will be spent on the request itself and not the text parsing.