I'd like to understand why this Scrape doesn't work:
I'm using BeautifulSoup for it:
from bs4 import BeautifulSoup
import requests
import pandas as pd
URL = 'https://www.carrefour.com.br/informatica/notebook#crfimt=hm-tlink|carrefour|menu|campanha|notebooks|4|270722?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}
soup1 = BeautifulSoup(page.content, 'html.parser')
soup2 = BeautifulSoup(soup1.prettify(), 'html.parser')
soup2.find('div', {'class': 'carrefourbr-carrefour-components-0-x-productNameContainer'})
But it doesn't find anything..
HTML Structure:
Can anybody help here?
P.S: This problem happened when Scraping at first:
So, what I've done was open my prompt and entered:
jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
I dont know if it has something to do with the problem..
CodePudding user response:
Your scraping effort doesn't work because that page is loading data dynamically, from different GraphQL APIs. You have two options: either use a solution like Selenium (https://www.selenium.dev/documentation/), either stay with python's requests, inspect the Dev tools - Network tab and get the GraphQL API url (quite a long and complex one), which in your case looks like this:
https://www.carrefour.com.br/_v/segment/graphql/v1?workspace=ab24824&maxAge=short&appsEtag=remove&domain=store&locale=pt-BR&__bindingId=3bab9213-2811-4d32-856a-a4baa1b689b5&operationName=productSearchV3&variables={}&extensions={"persistedQuery":{"version":1,"sha256Hash":"67d0a6ef4d455f259737e4edb1ed58f6db9ff823570356ebc88ae7c5532c0866","sender":"[email protected]","provider":"[email protected]"},"variables":"eyJoaWRlVW5hdmFpbGFibGVJdGVtcyI6ZmFsc2UsInNrdXNGaWx0ZXIiOiJBTExfQVZBSUxBQkxFIiwic2ltdWxhdGlvbkJlaGF2aW9yIjoiZGVmYXVsdCIsImluc3RhbGxtZW50Q3JpdGVyaWEiOiJNQVhfV0lUSE9VVF9JTlRFUkVTVCIsInByb2R1Y3RPcmlnaW5WdGV4IjpmYWxzZSwibWFwIjoiYyxjIiwicXVlcnkiOiJpbmZvcm1hdGljYS9ub3RlYm9vayIsIm9yZGVyQnkiOiJPcmRlckJ5U2NvcmVERVNDIiwiZnJvbSI6MTUsInRvIjoyOSwic2VsZWN0ZWRGYWNldHMiOlt7ImtleSI6ImMiLCJ2YWx1ZSI6ImluZm9ybWF0aWNhIn0seyJrZXkiOiJjIiwidmFsdWUiOiJub3RlYm9vayJ9XSwib3BlcmF0b3IiOiJhbmQiLCJmdXp6eSI6IjAiLCJzZWFyY2hTdGF0ZSI6bnVsbCwiZmFjZXRzQmVoYXZpb3IiOiJkeW5hbWljIiwiY2F0ZWdvcnlUcmVlQmVoYXZpb3IiOiJkZWZhdWx0Iiwid2l0aEZhY2V0cyI6ZmFsc2V9"}
This should answer your question as asked (I'd like to understand why this Scrape doesn't work:). If you need help with scraping the data, try my suggestions above and if you cannot do it, then post a question showing your actual efforts, and you will receive an answer to it.
CodePudding user response:
@barry-the-platipus is right; implementing a traditional web scraper for your project would require quite a lot of development.
If you're planning on investing in this project, I would suggest you opt for Selenium with Python. If you're more interested in a ready-to-use solution, you could give WebScrapingAPI a try. The service actually has a feature that extracts data based on a CSS selector and returns an JSON object.
Here is a fast Python implementation in WebScrapingAPI for extracting data from the CSS selector you are targeting:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://www.carrefour.com.br/informatica/notebook#crfimt=hm-tlink|carrefour|menu|campanha|notebooks|4|270722?page=1'
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
"extract_rules":'{"data":{"selector":".carrefourbr-carrefour-components-0-x-productNameContainer","output":"text"}}',
"wait_for":5000,
"proxy_type":"residential"
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
Response:
{
"data":[
"Notebook Gamer Acer Intel Core i5 8GB 512GB SSD GeForce GTX 4GB 15.6\"IPS Windows 11 Nitro 5 AN515-55-59T4 10ºGer.i5-10300H Preto",
"Notebook Acer Aspire 5 A514-54-52ty Intel Core I5 11ª Gen Windows 11 Home 8gb 256gb Sdd 14' Full Hd",
"Notebook Acer Intel Core i3 4GB 256GB SSD 15,6\"TN Windows 11 Aspire 3 A315-56-3478 10ºGer.Core i3–1005G1 Cinza",
"Notebook Acer Aspire 5 A515-54-57cs Intel Core I5 10ª Gen Windows 11 Home 8gb 256gb Sdd 15.6' FHD",
"Notebook Acer Intel Core i5 8GB 256GB SSD 15,6\"IPS Windows 11 Aspire 5 A515-54-57CS 10ºGer.Core i5–10210U Prata",
"Notebook Gamer Acer Intel Core i7 8GB 512GB GTX Windows11 Nitro 5 AN515-55-79X0 10ºGer.Core i7-10750H Preto"
]
}