I developed a function to get the book rank from an Amazon book page, but I'm not extremely satisfied by it. It would be great to know if this can be optimised to collect the rank in a more efficient way (by efficient I was thinking perhaps it's possible to use something like "if string contains" though I have not been successful here). Please find the code below:
def scrap_rank_amz(link):
# Step 1 — Get URL and content
url = link
request = Request(url, headers={"User-agent": "Mozilla/5.0"})
html = urlopen(request)
# Step 2 — Create BeautifulSoup Instance to get elements from HTML
soup = BeautifulSoup(html, "html.parser")
# Step 3 — Collect the info from the block where we will find the rank
soup_rank = soup.find_all(
class_="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list",
limit=2,
)
# Step 4 — Obtain specifically the items where we'll find the rank
soup_rank_detail = [i.find(class_="a-list-item").get_text(" ") for i in soup_rank]
# Step 5 — Obtain the rank
soup_rank_detail_lv2 = soup_rank_detail[1][24:32]
# Step 6 — Return rank value
return soup_rank_detail_lv2
You can find an example of a link to be used as follows: https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3
Thanks a lot for your time!
Sara
CodePudding user response:
As your question is unclear,So the following output is related to rank according the webpage that you laydown here.
Example:
from bs4 import BeautifulSoup
import requests
cookies = {'cookie': 'session-id=131-6404291-8225156'}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}
url = 'https://www.amazon.com/dp/B078SZLXB3'
req = requests.get(url,headers=headers,cookies=cookies)
print(req)
soup = BeautifulSoup(req.text, "lxml")
r = soup.select_one('span:-soup-contains("Best Sellers Rank:")').parent
t = r.text
rank = t.split('#')[1:]
print(rank)
Output:
['43,847 in Kindle Store (See Top 100 in Kindle Store) ', '296 in Vampire Mysteries ', '323 in Werewolf & Shifter Mysteries ', '432 in Cozy Culinary Mystery ']
CodePudding user response:
Let's say all the pages you want to scrap have a Kindle ranking, then you can use a simple regexp.
import re
from typing import Optional
from urllib.request import urlopen, Request
def get_kindle_rank(link: str) -> Optional[str]:
request = Request(link, headers={"User-agent": "Mozilla/5.0"})
html = urlopen(request).read().decode()
regexp_match = re.search(r"#([\d,] ) in Kindle", html)
if regexp_match:
return regexp_match.group(1)
else:
return None
get_kindle_rank("https://www.amazon.com/Moonshine-Magic-Southern-Charms-Mystery-ebook/dp/B078SZLXB3")
# '43,847'
However, it won't be much faster as most of the runtime will be spent on the request itself and not the text parsing.