I'm trying to extract data from a website. When using developer tools, I can see that the data I am interested in is held in multiple areas all with the same class name (flyers_flyer-col__ZN-6Z)
I want to loop through each of these items, and extract information, specifically the aria label and target href. When I try, I can only seem to extract the first item... I'm not sure how to loop through all of the items.
Here is a code I've tried:
for flyers in soup.find_all("div",class_='flyers_flyer-col__ZN_6Z'):
links = flyers.find_all("a",href=True)
for flyer in flyers:
print(flyer['href'])
however, this only gives me the results from the very first find of the flyers_flyer-col__ZN-6Z class. How can I get the rest?
CodePudding user response:
Page is created dynamically by Javascript, based on data existent in a script tag. Here is one way to get that data, using Requests:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
url = 'https://www.reebee.com/flyers?categoryID=2'
r = requests.get(url)
soup = bs(r.text, 'html.parser')
json_obj = json.loads(soup.select_one('script[id="__NEXT_DATA__"]').text)
df = pd.json_normalize(json_obj['props']['pageProps']['flyerList'])
print(df)
Results in terminal:
flyerID numberOfPages dateValid dateExpired priority resetVersion statusID flyerTypeID flyerVersion cycleID cycleDescriptionEn cycleDescriptionFr languageID category asset store.storeName store.storeID store.asset
0 1485954 17 2022-10-20 2022-10-26 422 0 2 1 10 113242 Weekly Flyer Circulaire hebdomadaire 0 [{'categoryID': 1}, {'categoryID': 4}, {'categoryID': 5}] [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 2, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/2diio0ujinwgsscoggk8oscgk/8d498e2ac387dda58d7577041341ead1_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 108}, {'width': 121, 'height': 161}, {'width': 145, 'height': 193}, {'width': 189, 'height': 252}, {'width': 209, 'height': 279}, {'width': 284, 'height': 379}, {'width': 291, 'height': 388}, {'width': 314, 'height': 419}, {'width': 388, 'height': 517}]}]}] The Home Depot 10028 [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 3, 'url': 'https://reebee-assets.azureedge.net/reebee-store-assets/asset/b2c1c3505fe0f9dd8dc2ea5431158bfa', 'contentType': [{'extension': '.webp', 'type': 'image/webp'}]}]
1 1487388 4 2022-10-25 2022-11-21 1805 0 2 1 7 113698 Transform Any Recipe NaN 0 [{'categoryID': 1}, {'categoryID': 2}] [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/cytg9ubx108c80owo84sogs8c/055f92a74153ee5d259b4ca8c5764f03_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 108}, {'width': 121, 'height': 161}, {'width': 145, 'height': 193}, {'width': 189, 'height': 252}, {'width': 209, 'height': 279}, {'width': 284, 'height': 379}, {'width': 291, 'height': 388}, {'width': 314, 'height': 419}, {'width': 388, 'height': 517}]}]}] VH 13578 [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-store-assets/338cb8d957621a8e05f7a28307c646a4_sl<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 102, 'height': 102}, {'width': 120, 'height': 120}, {'width': 150, 'height': 150}, {'width': 200, 'height': 200}]}]}]
2 1486725 15 2022-10-21 2022-10-27 1807 0 2 1 6 113458 Weekly Flyer Circulaire hebdomadaire 0 [{'categoryID': 1}, {'categoryID': 3}] [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/5cx8mbrucvc480oo00kccgkow/27d7c7fcb22f2500cf95ab4585b8be29_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 83}, {'width': 121, 'height': 124}, {'width': 145, 'height': 148}, {'width': 189, 'height': 193}, {'width': 209, 'height': 214}, {'width': 284, 'height': 291}, {'width': 291, 'height': 298}, {'width': 314, 'height': 321}, {'width': 388, 'height': 397}]}]}] 2001 Audio Video 10219 [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-store-assets/4a39eed200ea70f5640b8fadd16955d3_sl<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 102, 'height': 102}]}]}]
3 1483100 4 2022-10-03 2022-10-30 1808 0 2 1 9 112307 October Savings Économies d'octobre 0 [{'categoryID': 1}, {'categoryID': 10}] [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/3o3awyoez3uowkgkkccsg0cw8/d232f306202435180ee80ccc50cc7cc4_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 106}, {'width': 121, 'height': 159}, {'width': 145, 'height': 190}, {'width': 189, 'height': 248}, {'width': 209, 'height': 274}, {'width': 284, 'height': 372}, {'width': 291, 'height': 381}, {'width': 314, 'height': 411}, {'width': 388, 'height': 508}]}]}] PetSmart 13189 [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 1, 'url': 'https://reebee-assets.azureedge.net/reebee-store-assets/asset/1e1465c11faff57018aefe7a610d12a2', 'contentType': [{'extension': '.webp', 'type': 'image/webp'}]}]
4 1486315 83 2022-10-25 2022-11-06 1817 0 2 1 9 113361 Two-Week Flyer Circulaire de deux semaines 0 [{'categoryID': 1}, {'categoryID': 3}, {'categoryID': 4}, {'categoryID': 5}] [{'type': 'flyerAsset', 'assetTypeID': 4, 'version': 1, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-flyer-assets/eeqpblbrj3cogo8cooc048w48/71091c68f1258d46c676b9305ea48ee9_t<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 81, 'height': 108}, {'width': 121, 'height': 161}, {'width': 145, 'height': 193}, {'width': 189, 'height': 251}, {'width': 209, 'height': 278}, {'width': 284, 'height': 377}, {'width': 291, 'height': 386}, {'width': 314, 'height': 417}, {'width': 388, 'height': 515}]}]}] Princess Auto 10056 [{'type': 'storeLogoAsset', 'assetTypeID': 7, 'version': 2, 'url': 'https://d3179alu5b1vk5.cloudfront.net/reebee-store-assets/f8078d1e1744abbfd1bd61b4da4fb2c0_sl<width>x<height>', 'contentType': [{'extension': '.webp', 'type': 'image/webp', 'metadata': [{'width': 102, 'height': 102}]}]}]
You can drill down further in that json object - see pandas documentation here: https://pandas.pydata.org/docs/