I am new in crawling and I am trying to crawl. https://www.stradivarius.com/tr/en/woman/clothing/shop-by-product/sweatshirts-c1390587.html webpage, sometimes i could get hrefs but generally code gave me empty list? Do you have any suggesition?
This are packages:
import requests
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import *
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import json
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore')
from unidecode import unidecode
import re
import time
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
urlist = []
browser.get('https://www.stradivarius.com/tr/kadın/giyim/ürüne-göre-alışveriş/sweatshi̇rt-c1390587.html')
html = browser.page_source
soup = BeautifulSoup(html)
browser.implicitly_wait(90)
product_links=soup.find_all('a', {'id':'hrefRedirectProduct'})
for a in product_links:
urlist.append(product_links["href"])
CodePudding user response:
It's possible the data hasn't rendered yet. You have the .implicitly_wait(90)
but it's after you've already pulled the html. So you need to move that up in your code.
urlist = []
browser.get('https://www.stradivarius.com/tr/kadın/giyim/ürüne-göre-alışveriş/sweatshi̇rt-c1390587.html')
browser.implicitly_wait(90) #<--- wait for the page to render BEFORE...
html = browser.page_source # ...grabing the html source
soup = BeautifulSoup(html)
product_links=soup.find_all('a', {'id':'hrefRedirectProduct'})
for a in product_links:
urlist.append(product_links["href"])
A better solution may be to go after the data from the source.
Does this include your desired href?
import requests
import pandas as pd
url = 'https://www.stradivarius.com/itxrest/2/catalog/store/54009571/50331068/category/1390587/product?languageId=-43&appId=1'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData['products'])
Output:
print(df['productUrl'])
0 kolej-sweatshirt-l06710711
1 oversize-hard-rock-cafe-baskl-sweatshirt-l0670...
2 oversize-hard-rock-cafe-baskl-sweatshirt-l0670...
3 oversize-hard-rock-cafe-kapusonlu-sweatshirt-l...
4 fermuarl-sweatshirt-l06521718
60 fermuarl-oversize-kapusonlu-sweatshirt-l06765643
61 dikisli-basic-sweatshirt-l06519703
62 jogging-fit-pantolon-ve-sweatshirt-seti-l01174780
63 naylon-sweatshirt-l08221191
64 dikisli-basic-sweatshirt-l06519703
Name: productUrl, Length: 65, dtype: object