Home > Blockchain >  Web Scraping problem: trying to find the structure it keeps returning me 'None'
Web Scraping problem: trying to find the structure it keeps returning me 'None'

Time:09-19

I'd like to understand why this Scrape doesn't work:

I'm using BeautifulSoup for it:

from bs4 import BeautifulSoup
import requests
import pandas as pd

URL = 'https://www.carrefour.com.br/informatica/notebook#crfimt=hm-tlink|carrefour|menu|campanha|notebooks|4|270722?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}

soup1 = BeautifulSoup(page.content, 'html.parser')
soup2 = BeautifulSoup(soup1.prettify(), 'html.parser')
soup2.find('div', {'class': 'carrefourbr-carrefour-components-0-x-productNameContainer'})

But it doesn't find anything..

HTML Structure:

HTML Structure

Output: Output

Can anybody help here?

P.S: This problem happened when Scraping at first: enter image description here

So, what I've done was open my prompt and entered:

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

I dont know if it has something to do with the problem..

CodePudding user response:

Your scraping effort doesn't work because that page is loading data dynamically, from different GraphQL APIs. You have two options: either use a solution like Selenium (https://www.selenium.dev/documentation/), either stay with python's requests, inspect the Dev tools - Network tab and get the GraphQL API url (quite a long and complex one), which in your case looks like this:

https://www.carrefour.com.br/_v/segment/graphql/v1?workspace=ab24824&maxAge=short&appsEtag=remove&domain=store&locale=pt-BR&__bindingId=3bab9213-2811-4d32-856a-a4baa1b689b5&operationName=productSearchV3&variables={}&extensions={"persistedQuery":{"version":1,"sha256Hash":"67d0a6ef4d455f259737e4edb1ed58f6db9ff823570356ebc88ae7c5532c0866","sender":"[email protected]","provider":"[email protected]"},"variables":"eyJoaWRlVW5hdmFpbGFibGVJdGVtcyI6ZmFsc2UsInNrdXNGaWx0ZXIiOiJBTExfQVZBSUxBQkxFIiwic2ltdWxhdGlvbkJlaGF2aW9yIjoiZGVmYXVsdCIsImluc3RhbGxtZW50Q3JpdGVyaWEiOiJNQVhfV0lUSE9VVF9JTlRFUkVTVCIsInByb2R1Y3RPcmlnaW5WdGV4IjpmYWxzZSwibWFwIjoiYyxjIiwicXVlcnkiOiJpbmZvcm1hdGljYS9ub3RlYm9vayIsIm9yZGVyQnkiOiJPcmRlckJ5U2NvcmVERVNDIiwiZnJvbSI6MTUsInRvIjoyOSwic2VsZWN0ZWRGYWNldHMiOlt7ImtleSI6ImMiLCJ2YWx1ZSI6ImluZm9ybWF0aWNhIn0seyJrZXkiOiJjIiwidmFsdWUiOiJub3RlYm9vayJ9XSwib3BlcmF0b3IiOiJhbmQiLCJmdXp6eSI6IjAiLCJzZWFyY2hTdGF0ZSI6bnVsbCwiZmFjZXRzQmVoYXZpb3IiOiJkeW5hbWljIiwiY2F0ZWdvcnlUcmVlQmVoYXZpb3IiOiJkZWZhdWx0Iiwid2l0aEZhY2V0cyI6ZmFsc2V9"}

This should answer your question as asked (I'd like to understand why this Scrape doesn't work:). If you need help with scraping the data, try my suggestions above and if you cannot do it, then post a question showing your actual efforts, and you will receive an answer to it.

CodePudding user response:

@barry-the-platipus is right; implementing a traditional web scraper for your project would require quite a lot of development.

If you're planning on investing in this project, I would suggest you opt for Selenium with Python. If you're more interested in a ready-to-use solution, you could give WebScrapingAPI a try. The service actually has a feature that extracts data based on a CSS selector and returns an JSON object.

Here is a fast Python implementation in WebScrapingAPI for extracting data from the CSS selector you are targeting:

import requests

API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'

TARGET_URL = 'https://www.carrefour.com.br/informatica/notebook#crfimt=hm-tlink|carrefour|menu|campanha|notebooks|4|270722?page=1'

PARAMS = {
    "api_key":API_KEY,
    "url": TARGET_URL,
    "render_js":1,
    "extract_rules":'{"data":{"selector":".carrefourbr-carrefour-components-0-x-productNameContainer","output":"text"}}',
    "wait_for":5000,
    "proxy_type":"residential"
}

response = requests.get(SCRAPER_URL, params=PARAMS)

print(response.text)

Response:

{
   "data":[
      "Notebook Gamer Acer Intel Core i5 8GB 512GB SSD GeForce GTX 4GB 15.6\"IPS Windows 11 Nitro 5 AN515-55-59T4 10ºGer.i5-10300H Preto",
      "Notebook Acer Aspire 5 A514-54-52ty Intel Core I5 11ª Gen Windows 11 Home 8gb 256gb Sdd 14' Full Hd",
      "Notebook Acer Intel Core i3 4GB 256GB SSD 15,6\"TN Windows 11 Aspire 3 A315-56-3478 10ºGer.Core i3–1005G1 Cinza",
      "Notebook Acer Aspire 5 A515-54-57cs Intel Core I5 10ª Gen Windows 11 Home 8gb 256gb Sdd 15.6' FHD",
      "Notebook Acer Intel Core i5 8GB 256GB SSD 15,6\"IPS Windows 11 Aspire 5 A515-54-57CS 10ºGer.Core i5–10210U Prata",
      "Notebook Gamer Acer Intel Core i7 8GB 512GB GTX  Windows11 Nitro 5 AN515-55-79X0 10ºGer.Core i7-10750H Preto"
   ]
}
  • Related