Web Scraping problem: trying to find the structure it keeps returning me 'None'-CodePudding

I'd like to understand why this Scrape doesn't work:

I'm using BeautifulSoup for it:

from bs4 import BeautifulSoup
import requests
import pandas as pd

URL = 'https://www.carrefour.com.br/informatica/notebook#crfimt=hm-tlink|carrefour|menu|campanha|notebooks|4|270722?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}

soup1 = BeautifulSoup(page.content, 'html.parser')
soup2 = BeautifulSoup(soup1.prettify(), 'html.parser')
soup2.find('div', {'class': 'carrefourbr-carrefour-components-0-x-productNameContainer'})

But it doesn't find anything..

HTML Structure:

Output:

Can anybody help here?

P.S: This problem happened when Scraping at first:

So, what I've done was open my prompt and entered:

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

I dont know if it has something to do with the problem..

CodePudding user response：

Your scraping effort doesn't work because that page is loading data dynamically, from different GraphQL APIs. You have two options: either use a solution like Selenium (https://www.selenium.dev/documentation/), either stay with python's requests, inspect the Dev tools - Network tab and get the GraphQL API url (quite a long and complex one), which in your case looks like this:

https://www.carrefour.com.br/_v/segment/graphql/v1?workspace=ab24824&maxAge=short&appsEtag=remove&domain=store&locale=pt-BR&__bindingId=3bab9213-2811-4d32-856a-a4baa1b689b5&operationName=productSearchV3&variables={}&extensions={"persistedQuery":{"version":1,"sha256Hash":"67d0a6ef4d455f259737e4edb1ed58f6db9ff823570356ebc88ae7c5532c0866","sender":"[email protected]","provider":"[email protected]"},"variables":"eyJoaWRlVW5hdmFpbGFibGVJdGVtcyI6ZmFsc2UsInNrdXNGaWx0ZXIiOiJBTExfQVZBSUxBQkxFIiwic2ltdWxhdGlvbkJlaGF2aW9yIjoiZGVmYXVsdCIsImluc3RhbGxtZW50Q3JpdGVyaWEiOiJNQVhfV0lUSE9VVF9JTlRFUkVTVCIsInByb2R1Y3RPcmlnaW5WdGV4IjpmYWxzZSwibWFwIjoiYyxjIiwicXVlcnkiOiJpbmZvcm1hdGljYS9ub3RlYm9vayIsIm9yZGVyQnkiOiJPcmRlckJ5U2NvcmVERVNDIiwiZnJvbSI6MTUsInRvIjoyOSwic2VsZWN0ZWRGYWNldHMiOlt7ImtleSI6ImMiLCJ2YWx1ZSI6ImluZm9ybWF0aWNhIn0seyJrZXkiOiJjIiwidmFsdWUiOiJub3RlYm9vayJ9XSwib3BlcmF0b3IiOiJhbmQiLCJmdXp6eSI6IjAiLCJzZWFyY2hTdGF0ZSI6bnVsbCwiZmFjZXRzQmVoYXZpb3IiOiJkeW5hbWljIiwiY2F0ZWdvcnlUcmVlQmVoYXZpb3IiOiJkZWZhdWx0Iiwid2l0aEZhY2V0cyI6ZmFsc2V9"}

This should answer your question as asked (I'd like to understand why this Scrape doesn't work:). If you need help with scraping the data, try my suggestions above and if you cannot do it, then post a question showing your actual efforts, and you will receive an answer to it.

CodePudding user response：

@barry-the-platipus is right; implementing a traditional web scraper for your project would require quite a lot of development.

If you're planning on investing in this project, I would suggest you opt for Selenium with Python. If you're more interested in a ready-to-use solution, you could give WebScrapingAPI a try. The service actually has a feature that extracts data based on a CSS selector and returns an JSON object.

Here is a fast Python implementation in WebScrapingAPI for extracting data from the CSS selector you are targeting:

import requests

API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'

TARGET_URL = 'https://www.carrefour.com.br/informatica/notebook#crfimt=hm-tlink|carrefour|menu|campanha|notebooks|4|270722?page=1'

PARAMS = {
    "api_key":API_KEY,
    "url": TARGET_URL,
    "render_js":1,
    "extract_rules":'{"data":{"selector":".carrefourbr-carrefour-components-0-x-productNameContainer","output":"text"}}',
    "wait_for":5000,
    "proxy_type":"residential"
}

response = requests.get(SCRAPER_URL, params=PARAMS)

print(response.text)

Response:

{
   "data":[
      "Notebook Gamer Acer Intel Core i5 8GB 512GB SSD GeForce GTX 4GB 15.6\"IPS Windows 11 Nitro 5 AN515-55-59T4 10ºGer.i5-10300H Preto",
      "Notebook Acer Aspire 5 A514-54-52ty Intel Core I5 11ª Gen Windows 11 Home 8gb 256gb Sdd 14' Full Hd",
      "Notebook Acer Intel Core i3 4GB 256GB SSD 15,6\"TN Windows 11 Aspire 3 A315-56-3478 10ºGer.Core i3–1005G1 Cinza",
      "Notebook Acer Aspire 5 A515-54-57cs Intel Core I5 10ª Gen Windows 11 Home 8gb 256gb Sdd 15.6' FHD",
      "Notebook Acer Intel Core i5 8GB 256GB SSD 15,6\"IPS Windows 11 Aspire 5 A515-54-57CS 10ºGer.Core i5–10210U Prata",
      "Notebook Gamer Acer Intel Core i7 8GB 512GB GTX  Windows11 Nitro 5 AN515-55-79X0 10ºGer.Core i7-10750H Preto"
   ]
}