Home > Software engineering >  How to extract content from tags with Beautiful soup
How to extract content from tags with Beautiful soup

Time:10-17

I have been trying to practice web-scraping with beautiful soup. But everytime I changed a website, the tags structure are so different which really confuses me. This time I am trying to scrap the amazon best seller site (enter image description here

My idea is to find the "main" tag for each item and dig into the tag that has the information I want. So I used .select() and started with the "li class". But when I try to add tags after "span.a-list-item", I then get empty result with the following code,

container = page.select('li.zg-item-immersion > span.a-list-item > div.a-section a-spacing-none aok-relative' )

Is there a tag limit I can put into .select() or am I doing something wrong?

So I stopped at "span. a-list-item" and tried the following approach, but I don't understand why my code sometimes gives me the empty result and sometimes returns the things I want... I guess this is something related to the connection to the website?

from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.com/Best-Sellers-Appstore- 
Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
page = BeautifulSoup(requests.get(url).content,'lxml')    
containers = page.select('li.zg-item-immersion > span.a-list-item')
ranking = (containers[1].find("span",class_="zg-badge-text").text)[1:]

On the last line, I was able to get the ranking number successfully with that line of code, but when I try to append them into a list with a loop,

for item in range(50):
   ranking.append((containers[item].find("span",class_="zg-badge-text").text)[1:])

I keep getting "list index out of range" error which I don't understand why it is out of range as there is 50 items on a single page.

Last but not least, can I please get some advice on learning to scape different websites? I also read the beautifulsoup document and follow the instruction on using the different functions to get to a specific tag but still not getting what I want...

CodePudding user response:

Actually, after for loop it didn't grab data from a range of list as text. You also need to inject user agent as headers.

Code:

from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
url = "https://www.amazon.com/Best-Sellers-Appstore- Android/zgbs/mobile-apps/ref=zg_bs_unv_mas_1_9408444011_1"
r =requests.get(url, headers = headers)
page = BeautifulSoup(r.content,'lxml') 

containers = page.select('li.zg-item-immersion > span.a-list-item')
for container in containers:
    ranking = container.find("span",class_="zg-badge-text").text
    print(ranking)

Output:

#1
#2 
#3 
#4 
#5 
#6 
#7 
#8 
#9 
#10
#11
#12
#13
#14
#15
#16
#17
#18
#19
#20
#21
#22
#23
#24
#25
#26
#27
#28
#29
#30
#31
#32
#33
#34
#35
#36
#37
#38
#39
#40
#41
#42
#43
#44
#45
#46
#47
#48
#49
#50
  • Related