so sorry if I have dumb questions , is my first ever scrapping code, I have been trying to get the data of a page of informatic things and scrape it and save the data...
but having troubles on get it right to work.
the code i Write is meant to get all the links variants of one category ( 40 items per category) until that works pretty well.
The rest of the cody is for getting the info, for the 40 first data on the first link work very good, but when i tried to iterate it got really messed up, not working the second part that is get the data.
#https://www.youtube.com/watch?v=wLRNdCTXmnE
import requests
from bs4 import BeautifulSoup as bs
import itertools
import numpy as np
pages=[]
prices=[]
ids=[]
list_codigo=[]
prices=[]
url_collected=[]
#Loop to go over all pages
pages= np.arange(40,120,40)
print(pages)
#loop in pages for get a array of link
for page in pages:
a='https://www.paris.cl/tecnologia/consolas-videojuegos/?start='
b='&sz=40'
c=str(page)
page = a c b
print(page)
url_collected.append(page)
print(url_collected)
#https://www.paris.cl/tecnologia/consolas-videojuegos/?start=40&sza=40
response=requests.get(page).text
soup=bs(response,"html.parser")
#websscraping the data of the links * not working so good
for object in soup.find_all("div",class_='price-content'):
final =object.find_all(class_="price__text")
price =final[0].get('aria-label')
print(price)
prices.append(price)
for object in soup.find_all("div",class_='onecolumn'):
final2 =object.find_all(class_="product-tile")
id1 =final2[0].get('data-itemid')
list_codigo.append(id1)
print(id1)
# get data in array like csv format
for n, v in zip(prices, list_codigo):
print("{} , {}".format(n, v))
# price = final[0].get('content')
#prices.append(price)
someone knows what i am making wrong?
CodePudding user response:
Don't scrape separatelly id
,price
,name
, etc. because some products may nave 2 or 3 prices, other product may not have some value and it will skip this, and later zip()
will create wrong pairs.
Better first find all products - all product-tile
- and later run for
-loop to work with every product separatelly and search id
, price
, name
in single product-tile
. If product has many prices then you can simply get only one, and if it has missing value then you can assign None
or default value.
Minimal working code.
I keep only important elements.
Because words product
and products
are very similar and it is easy to make mistake so I use prefix all_
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.paris.cl/tecnologia/consolas-videojuegos/'
params = {
'start': 0,
'sz': 40,
}
results = []
for offset in range(0, 121, 40): # set end at 121` so it will use `120`, if you set end at `120` then it will finish on `80`
params['start'] = offset
response = requests.get(url, params=params)
print('url:', response.url)
#print('status:', response.status_code)
soup = bs(response.text, "html.parser")
all_products = soup.find_all('div', {'class': 'product-tile'})
for product in all_products:
itemid = product.get('data-itemid')
print('itemid:', itemid)
data = product.get('data-product')
print('data:', data)
name = product.find('span', {'itemprop': 'name'}).text
print('name:', name)
all_prices = product.find_all('div', {'class': 'price__text'})
print('len(all_prices):', len(all_prices))
price = all_prices[0].get('aria-label')
print('price:', price)
results.append( (itemid, name, price, data) )
print('---')
# ---
# ... here you can save all `results` in file ...
Result:
url: https://www.paris.cl/tecnologia/consolas-videojuegos/?start=0&sz=40
itemid: CBELC349
data: {"id":"CBELC349","name":"Consola Nintendo Switch Neon Switch Mario Kart 8 Deluxe","variant":"CBELC349","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"419990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"","dimension33":"010","dimension12":"","dimension18":"","dimension19":"469990","dimension20":"399990","dimension30":"Nintendo","dimension41":"4.8571","dimension42":14}
name: Consola Nintendo Switch Neon Switch Mario Kart 8 Deluxe
len(all_prices): 2
price: 399.990 pesos
---
itemid: 259382999
data: {"id":"259382999","name":"Consola Nintendo Switch Neon ","variant":"259382999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"369990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"399990","dimension20":"359990","dimension21":"True","dimension30":"Nintendo","dimension41":"4.644","dimension42":191}
name: Consola Nintendo Switch Neon
len(all_prices): 2
price: 359.990 pesos
---
itemid: 292147999
data: {"id":"292147999","name":"Nintendo Switch OLED White Joy-Con","variant":"292147999","category":"Tecno/Consolas y VideoJuegos/Consolas Nintendo","brand":"Nintendo","price":"459990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"010","dimension12":"CONSOLA PORTABLES","dimension18":"","dimension19":"469990","dimension20":0,"dimension21":"True","dimension30":"Nintendo","dimension41":"4.9574","dimension42":47}
name: Nintendo Switch OLED White Joy-Con
len(all_prices): 1
price: 459.990 pesos
---
itemid: 590573999
data: {"id":"590573999","name":"Consola Sony PS4 Slim 1TB Black","variant":"590573999","category":"Tecno/Consolas y VideoJuegos/Consolas PlayStation","brand":"Sony","price":"539990","dimension2":"743","dimension3":"VIDEOJUEGOS","dimension32":"005","dimension11":"CONSOLAS","dimension33":"005","dimension12":"CONSOLA HOME","dimension18":"","dimension19":"549990","dimension20":"529990","dimension21":"True","dimension30":"Sony"}
name: Consola Sony PS4 Slim 1TB Black
len(all_prices): 2
price: 529.990 pesos
---
Frankly, most values you can get from data-product
- it has id
, name
, price
(it needs only to divide by 10000), brand
, category