Home > Back-end >  Why Selenium sometimes can't find href witout error
Why Selenium sometimes can't find href witout error

Time:10-23

I am new in crawling and I am trying to crawl. https://www.stradivarius.com/tr/en/woman/clothing/shop-by-product/sweatshirts-c1390587.html webpage, sometimes i could get hrefs but generally code gave me empty list? Do you have any suggesition?

This are packages:

import requests
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import *
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import json
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore')
from unidecode import unidecode
import re
import time
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())

urlist = []
browser.get('https://www.stradivarius.com/tr/kadın/giyim/ürüne-göre-alışveriş/sweatshi̇rt-c1390587.html')
html = browser.page_source
soup = BeautifulSoup(html)
browser.implicitly_wait(90)
product_links=soup.find_all('a', {'id':'hrefRedirectProduct'})
for a in product_links:
    urlist.append(product_links["href"])

CodePudding user response:

It's possible the data hasn't rendered yet. You have the .implicitly_wait(90) but it's after you've already pulled the html. So you need to move that up in your code.

urlist = []
browser.get('https://www.stradivarius.com/tr/kadın/giyim/ürüne-göre-alışveriş/sweatshi̇rt-c1390587.html')
browser.implicitly_wait(90)  #<--- wait for the page to render BEFORE...
html = browser.page_source  # ...grabing the html source
soup = BeautifulSoup(html)
product_links=soup.find_all('a', {'id':'hrefRedirectProduct'})
for a in product_links:
    urlist.append(product_links["href"])

A better solution may be to go after the data from the source.

Does this include your desired href?

import requests
import pandas as pd

url = 'https://www.stradivarius.com/itxrest/2/catalog/store/54009571/50331068/category/1390587/product?languageId=-43&appId=1'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()

df = pd.DataFrame(jsonData['products'])

Output:

print(df['productUrl'])
0                            kolej-sweatshirt-l06710711
1     oversize-hard-rock-cafe-baskl-sweatshirt-l0670...
2     oversize-hard-rock-cafe-baskl-sweatshirt-l0670...
3     oversize-hard-rock-cafe-kapusonlu-sweatshirt-l...
4                         fermuarl-sweatshirt-l06521718
                       
60     fermuarl-oversize-kapusonlu-sweatshirt-l06765643
61                   dikisli-basic-sweatshirt-l06519703
62    jogging-fit-pantolon-ve-sweatshirt-seti-l01174780
63                          naylon-sweatshirt-l08221191
64                   dikisli-basic-sweatshirt-l06519703
Name: productUrl, Length: 65, dtype: object
  • Related