I have been trying to practice web-scraping with beautiful soup. But everytime I changed a website, the tags structure are so different which really confuses me. This time I am trying to scrap the amazon best seller site (
My idea is to find the "main" tag for each item and dig into the tag that has the information I want. So I used .select() and started with the "li class". But when I try to add tags after "span.a-list-item", I then get empty result with the following code,
container = page.select('li.zg-item-immersion > span.a-list-item > div.a-section a-spacing-none aok-relative' )
Is there a tag limit I can put into .select() or am I doing something wrong?
So I stopped at "span. a-list-item" and tried the following approach, but I don't understand why my code sometimes gives me the empty result and sometimes returns the things I want... I guess this is something related to the connection to the website?
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.com/Best-Sellers-Appstore-
Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
page = BeautifulSoup(requests.get(url).content,'lxml')
containers = page.select('li.zg-item-immersion > span.a-list-item')
ranking = (containers[1].find("span",class_="zg-badge-text").text)[1:]
On the last line, I was able to get the ranking number successfully with that line of code, but when I try to append them into a list with a loop,
for item in range(50):
ranking.append((containers[item].find("span",class_="zg-badge-text").text)[1:])
I keep getting "list index out of range" error which I don't understand why it is out of range as there is 50 items on a single page.
Last but not least, can I please get some advice on learning to scape different websites? I also read the beautifulsoup document and follow the instruction on using the different functions to get to a specific tag but still not getting what I want...
CodePudding user response:
Actually, after for loop it didn't grab data from a range of list as text. You also need to inject user agent as headers.
Code:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
url = "https://www.amazon.com/Best-Sellers-Appstore- Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
r =requests.get(url, headers = headers)
page = BeautifulSoup(r.content,'lxml')
containers = page.select('li.zg-item-immersion > span.a-list-item')
for container in containers:
ranking = container.find("span",class_="zg-badge-text").text
print(ranking)
Output:
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
#26
#27
#28
#29
#30
#31
#32
#33
#34
#35
#36
#37
#38
#39
#40
#41
#42
#43
#44
#45
#46
#47
#48
#49
#50