Home > Blockchain >  Problem with scraping website 2 pages opens rest not
Problem with scraping website 2 pages opens rest not

Time:09-21

So i have been trying to write data scraper for online shop with cables and other stuff. I wrote simple code that should work. Shop has structure of products divided to categories and i took on first category with cables.

 for i in range(0, 27):
    url = "https://onninen.pl/produkty/Kable-i-przewody?query=/strona:{0}"
    url = url.format(i)

and it works fine for first two pages with i = to 0 and 1 (i get code_response 200) but no matter what time i try other pages 2 returns error 500 and i have no idea why especially when they open normally from the same link manually. I even tried to randomize time between requests :( Any idea what might be the problem ? Should i try using other web scraping library ? Below is full code :

import requests
from fake_useragent import UserAgent
import pandas as pd
from bs4 import BeautifulSoup
import time
import random

products = []  # List to store name of the product
MIN = []  # Manufacturer item number
prices = []  # List to store price of the product
df = pd.DataFrame()
user_agent = UserAgent()
i = 0
for i in range(0, 27):
    url = "https://onninen.pl/produkty/Kable-i-przewody?query=/strona:{0}"
    url = url.format(i)
    #print(url)
    # getting the response from the page using get method of requests module
    page = requests.get(url, headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"})
    #print(page.status_code)
    # storing the content of the page in a variable
    html = page.content
    # creating BeautifulSoup object
    page_soup = BeautifulSoup(html, "html.parser")
    #print(page_soup.prettify())
    for containers in page_soup.findAll('div', {'class': 'styles__ProductsListItem-vrexg1-2 gkrzX'}):
        name = containers.find('label', attrs={'class': 'styles__Label-sc-1x6v2mz-2 gmFpMA label'})
        price = containers.find('span', attrs={'class': 'styles__PriceValue-sc-33rfvt-10 fVFAzY'})
        man_it_num = containers.find('div', attrs={'title': 'Indeks producenta'})
        formatted_name = name.text.replace('Dodaj do koszyka: ', '')
        products.append(formatted_name)
        prices.append(price.text)
        MIN.append(man_it_num.text)

    df = pd.DataFrame({'Product Name': products, 'Price': prices, 'MIN': MIN})
    time.sleep(random.randint(2, 11))
#df.to_excel('output.xlsx', sheet_name='Kable i przewody')

CodePudding user response:

Because Total pages loaded dynamically via API. So to get all data, you have to use API.

Example:

import pandas as pd
import requests
api_url = 'https://onninen.pl/api/search?query=/Kable-i-przewody/strona:{p}'  
headers = {
    'user-agent': 'Mozilla/5.0',
    'referer': 'https://onninen.pl/produkty/Kable-i-przewody?query=/strona:2',
    'cookie': '_gid=GA1.2.1022119173.1663690794; _fuid=60a315c76d054fd5add850c7533f529e; _gcl_au=1.1.1522602410.1663690804; pollsvisible=[]; smuuid=1835bb31183-22686567c511-4116ddce-c55aa071-2639dbd6-ec19e64a550c; _smvs=DIRECT; poll_random_44=1; poll_visited_pages=2; _ga=GA1.2.1956280663.1663690794; smvr=eyJ2aXNpdHMiOjEsInZpZXdzIjo3LCJ0cyI6MTY2MzY5MjU2NTI0NiwibnVtYmVyT2ZSZWplY3Rpb25CdXR0b25DbGljayI6MCwiaXNOZXdTZXNzaW9uIjpmYWxzZX0=; _ga_JXR5QZ2XSJ=GS1.1.1663690794.1.1.1663692567.0.0.0'
    }

dfs = []
for p in range(1,28):
    d=requests.get(api_url.format(p=p),headers=headers).json()['items'][0]['items']
    df = pd.DataFrame(d)
    dfs.append(df)
df = pd.concat(dfs)
print(df)

Output:

id                                               slug   index    catalogindex  ... onntopcb  isnew    qc   ads
0   147774  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x...  HES890  112271067D0500  ...        0  False  None  None
1    45315  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x...  HES893  112271068D0500  ...        0  False  None  None
2   169497  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-3x...  HES896  112271069D0500  ...        0  False  None  None
3   141820  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-4x...  HES900  112271056D0500  ...        0  False  None  None
4    47909  KABLE-ROZNE-MARKI-Kabel-energetyczny-YKY-ZO-4x...  HES903  112271064D0500  ...        0  False  None  None
..     ...                                                ...     ...             ...  ...      ...    ...   ...   ...
37  111419  NVENT-RAYCHEM-Kabel-grzejny-EM2-XR-samoreguluj...  HDZ938      449561-000  ...        0   True  None  None
38  176526  NVENT-RAYCHEM-Przewod-stalooporowy-GM-2CW-35m-...  HEA099      SZ18300102  ...        0  False  None  None
39   38484  DEVI-Mata-grzewcza-DEVIheat-150S-150W-m2-375W-...  HAJ162        140F0332  ...        1  False  None  None
40   60982  DEVI-Mata-grzewcza-DEVImat-150T-150W-m2-375W-0...  HAJ157        140F0448  ...        1  False  None  None
41  145612  DEVI-Czujnik-Devireg-850-rynnowy-czujnik-140F1...  HAJ212        140F1086  ...        0  False  None  None

[1292 rows x 27 columns]
  • Related