Home > Enterprise >  can't find all div classes using beautifulsoup in python
can't find all div classes using beautifulsoup in python

Time:10-26

I'm trying to extract data from a website. When using developer tools, I can see that the data I am interested in is held in multiple areas all with the same class name (flyers_flyer-col__ZN-6Z) Data of Interest

I want to loop through each of these items, and extract information, specifically the aria label and target href. When I try, I can only seem to extract the first item... I'm not sure how to loop through all of the items.

Here is a code I've tried:

for flyers in soup.find_all("div",class_='flyers_flyer-col__ZN_6Z'):
    links = flyers.find_all("a",href=True)
    for flyer in flyers:
        print(flyer['href'])

however, this only gives me the results from the very first find of the flyers_flyer-col__ZN-6Z class. How can I get the rest?

CodePudding user response:

Page is created dynamically by Javascript, based on data existent in a script tag. Here is one way to get that data, using Requests:

import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

url = 'https://www.reebee.com/flyers?categoryID=2'
r = requests.get(url)
soup = bs(r.text, 'html.parser')
json_obj = json.loads(soup.select_one('script[id="__NEXT_DATA__"]').text)
df = pd.json_normalize(json_obj['props']['pageProps']['flyerList'])
print(df)

Results in terminal:

flyerID numberOfPages   dateValid   dateExpired priority    resetVersion    statusID    flyerTypeID flyerVersion    cycleID cycleDescriptionEn  cycleDescriptionFr  languageID  category    asset   store.storeName store.storeID   store.asset
0   1485954 17  2022-10-20  2022-10-26  422 0   2   1   10  113242  Weekly Flyer    Circulaire hebdomadaire 0   [{'categoryID': 1}, {'categoryID': 4}, {'categoryID': 5}]   [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 2, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/2diio0ujinwgsscoggk8oscgk/8d498e2ac387dda58d7577041341ead1_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 108}, {'width': 121, 'height': 161}, {'width': 145, 'height': 193}, {'width': 189, 'height': 252}, {'width': 209, 'height': 279}, {'width': 284, 'height': 379}, {'width': 291, 'height': 388}, {'width': 314, 'height': 419}, {'width': 388, 'height': 517}]}]}]    The Home Depot  10028   [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 3, 'url': 'https://reebee-assets.azureedge.net/reebee-store-assets/asset/b2c1c3505fe0f9dd8dc2ea5431158bfa', 'contentType': [{'extension': '.webp', 'type': 'image/webp'}]}]
1   1487388 4   2022-10-25  2022-11-21  1805    0   2   1   7   113698  Transform Any Recipe    NaN 0   [{'categoryID': 1}, {'categoryID': 2}]  [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/cytg9ubx108c80owo84sogs8c/055f92a74153ee5d259b4ca8c5764f03_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 108}, {'width': 121, 'height': 161}, {'width': 145, 'height': 193}, {'width': 189, 'height': 252}, {'width': 209, 'height': 279}, {'width': 284, 'height': 379}, {'width': 291, 'height': 388}, {'width': 314, 'height': 419}, {'width': 388, 'height': 517}]}]}]    VH  13578   [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-store-assets/338cb8d957621a8e05f7a28307c646a4_sl<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 102, 'height': 102}, {'width': 120, 'height': 120}, {'width': 150, 'height': 150}, {'width': 200, 'height': 200}]}]}]
2   1486725 15  2022-10-21  2022-10-27  1807    0   2   1   6   113458  Weekly Flyer    Circulaire hebdomadaire 0   [{'categoryID': 1}, {'categoryID': 3}]  [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/5cx8mbrucvc480oo00kccgkow/27d7c7fcb22f2500cf95ab4585b8be29_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 83}, {'width': 121, 'height': 124}, {'width': 145, 'height': 148}, {'width': 189, 'height': 193}, {'width': 209, 'height': 214}, {'width': 284, 'height': 291}, {'width': 291, 'height': 298}, {'width': 314, 'height': 321}, {'width': 388, 'height': 397}]}]}] 2001 Audio Video    10219   [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-store-assets/4a39eed200ea70f5640b8fadd16955d3_sl<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 102, 'height': 102}]}]}]
3   1483100 4   2022-10-03  2022-10-30  1808    0   2   1   9   112307  October Savings Économies d'octobre 0   [{'categoryID': 1}, {'categoryID': 10}] [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/3o3awyoez3uowkgkkccsg0cw8/d232f306202435180ee80ccc50cc7cc4_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 106}, {'width': 121, 'height': 159}, {'width': 145, 'height': 190}, {'width': 189, 'height': 248}, {'width': 209, 'height': 274}, {'width': 284, 'height': 372}, {'width': 291, 'height': 381}, {'width': 314, 'height': 411}, {'width': 388, 'height': 508}]}]}]    PetSmart    13189   [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 1, 'url': 'https://reebee-assets.azureedge.net/reebee-store-assets/asset/1e1465c11faff57018aefe7a610d12a2', 'contentType': [{'extension': '.webp', 'type': 'image/webp'}]}]
4   1486315 83  2022-10-25  2022-11-06  1817    0   2   1   9   113361  Two-Week Flyer  Circulaire de deux semaines 0   [{'categoryID': 1}, {'categoryID': 3}, {'categoryID': 4}, {'categoryID': 5}]    [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/eeqpblbrj3cogo8cooc048w48/71091c68f1258d46c676b9305ea48ee9_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 108}, {'width': 121, 'height': 161}, {'width': 145, 'height': 193}, {'width': 189, 'height': 251}, {'width': 209, 'height': 278}, {'width': 284, 'height': 377}, {'width': 291, 'height': 386}, {'width': 314, 'height': 417}, {'width': 388, 'height': 515}]}]}]    Princess Auto   10056   [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 2, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-store-assets/f8078d1e1744abbfd1bd61b4da4fb2c0_sl<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 102, 'height': 102}]}]}]

You can drill down further in that json object - see pandas documentation here: https://pandas.pydata.org/docs/

  • Related