How to extract content from tags with Beautiful soup-CodePudding

I have been trying to practice web-scraping with beautiful soup. But everytime I changed a website, the tags structure are so different which really confuses me. This time I am trying to scrap the amazon best seller site (

My idea is to find the "main" tag for each item and dig into the tag that has the information I want. So I used .select() and started with the "li class". But when I try to add tags after "span.a-list-item", I then get empty result with the following code,

container = page.select('li.zg-item-immersion > span.a-list-item > div.a-section a-spacing-none aok-relative' )

Is there a tag limit I can put into .select() or am I doing something wrong?

So I stopped at "span. a-list-item" and tried the following approach, but I don't understand why my code sometimes gives me the empty result and sometimes returns the things I want... I guess this is something related to the connection to the website?

from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.com/Best-Sellers-Appstore- 
Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
page = BeautifulSoup(requests.get(url).content,'lxml')    
containers = page.select('li.zg-item-immersion > span.a-list-item')
ranking = (containers[1].find("span",class_="zg-badge-text").text)[1:]

On the last line, I was able to get the ranking number successfully with that line of code, but when I try to append them into a list with a loop,

for item in range(50):
   ranking.append((containers[item].find("span",class_="zg-badge-text").text)[1:])

I keep getting "list index out of range" error which I don't understand why it is out of range as there is 50 items on a single page.

Last but not least, can I please get some advice on learning to scape different websites? I also read the beautifulsoup document and follow the instruction on using the different functions to get to a specific tag but still not getting what I want...

CodePudding user response：

Actually, after for loop it didn't grab data from a range of list as text. You also need to inject user agent as headers.

Code:

from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
url = "https://www.amazon.com/Best-Sellers-Appstore- Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
r =requests.get(url, headers = headers)
page = BeautifulSoup(r.content,'lxml') 

containers = page.select('li.zg-item-immersion > span.a-list-item')
for container in containers:
    ranking = container.find("span",class_="zg-badge-text").text
    print(ranking)

Output:

#1
#2 
#3 
#4 
#5 
#6 
#7 
#8 
#9 
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
#26
#27
#28
#29
#30
#31
#32
#33
#34
#35
#36
#37
#38
#39
#40
#41
#42
#43
#44
#45
#46
#47
#48
#49
#50