I have three problems, that are kind of related:
In the page https://www.maison-objet.com/paris/les-exposants, I would like to access the attribute "a href" of "BRITOP LIGHTING POLAND"
So this is what I wrote
from requests_html import HTMLSession
url = 'https://www.maison-objet.com/paris/les-exposants'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
products = r.html.xpath('//*[@]/h3/a').__getattribute__("href")
print(products)
I get this error
AttributeError: 'list' object has no attribute 'href'
Second thing I notice: if I try to copy the XPath of "BRITOP LIGHTING POLAND", I get
//*[@id="resultatsFiltres"]/div/div/div[1]/div/div/div[2]/h3/a
I don't understand why it is different
Third thing that doesn't work and I don't understand is:
products = r.html.find('.descBloc')[1]
print(products)
But I get
"IndexError: list index out of range"
CodePudding user response:
sorry I cannot comment due to my reputation < 50.
for issue #1, can you do one thing? seem like it's return a list
products = r.html.xpath('//*[@]/h3/a').__getattribute__("href")
for item in products:
print(items)
for issue #2, can you check the type
?
type(r.html.find('.descBloc'))
if it is a str
then you cannot use indexing.
CodePudding user response:
It seems request_html
has problem to load this page - maybe server detects bot/script and it sends different content, or it uses JavaScript which can't be executed by request_html
The only working code for me is with Selenium which controls real web browser.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
url = 'https://www.maison-objet.com/en/paris/les-exposants'
#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))
driver.get(url)
all_items = driver.find_elements(By.XPATH, '//*[@]/h3/a')
print('len(all_items):', len(all_items))
for item in all_items:
print('text:', item.text)
print('url :', item.get_attribute('href'))
print('---')
Result:
len(all_items): 51
text: BRITOP LIGHTING POLAND
url : https://www.maison-objet.com/paris/les-exposants/britop-lighting-poland-today
---
text: FEELGOOD DESIGNS
url : https://www.maison-objet.com/paris/les-exposants/feelgood-designs-today
---
text: KASZER
url : https://www.maison-objet.com/paris/les-exposants/kaszer-fashion-accessories
---
text:
url : https://www.maison-objet.com/paris/les-exposants/balma-capoduri-c-s-p-a-smart-gift
---
text:
url : https://www.maison-objet.com/paris/les-exposants/goodwill-m-g-home-accessories
---
# ...
But result shows other problem - it gets text
only for visible elements. Page may use lazy-loading
and add elements when user scroll page (and when elements are visible in window). It may need some JavaScript code to scroll elements (driver.execute_script(...)
)
EDIT:
I had to add /en/
in url to get page in English instead of French.