Home > Blockchain >  Python - Xpath with module requests_html to get value of "a href"
Python - Xpath with module requests_html to get value of "a href"

Time:08-11

I have three problems, that are kind of related:

In the page https://www.maison-objet.com/paris/les-exposants, I would like to access the attribute "a href" of "BRITOP LIGHTING POLAND"

So this is what I wrote

from requests_html import HTMLSession

url = 'https://www.maison-objet.com/paris/les-exposants'

s = HTMLSession()
r = s.get(url)

r.html.render(sleep=1)

products = r.html.xpath('//*[@]/h3/a').__getattribute__("href")

print(products)

I get this error

AttributeError: 'list' object has no attribute 'href'

Second thing I notice: if I try to copy the XPath of "BRITOP LIGHTING POLAND", I get

//*[@id="resultatsFiltres"]/div/div/div[1]/div/div/div[2]/h3/a

I don't understand why it is different

Third thing that doesn't work and I don't understand is:

products = r.html.find('.descBloc')[1]
        print(products)

But I get

"IndexError: list index out of range"

CodePudding user response:

sorry I cannot comment due to my reputation < 50.

for issue #1, can you do one thing? seem like it's return a list

products = r.html.xpath('//*[@]/h3/a').__getattribute__("href")
for item in products:
    print(items)

for issue #2, can you check the type?

type(r.html.find('.descBloc'))

if it is a str then you cannot use indexing.

CodePudding user response:

It seems request_html has problem to load this page - maybe server detects bot/script and it sends different content, or it uses JavaScript which can't be executed by request_html

The only working code for me is with Selenium which controls real web browser.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager


url = 'https://www.maison-objet.com/en/paris/les-exposants'

#driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver = webdriver.Firefox(service=Service(GeckoDriverManager().install()))

driver.get(url)

all_items = driver.find_elements(By.XPATH, '//*[@]/h3/a')
print('len(all_items):', len(all_items))

for item in all_items:
    print('text:', item.text)
    print('url :', item.get_attribute('href'))
    print('---')

Result:

len(all_items): 51
text: BRITOP LIGHTING POLAND
url : https://www.maison-objet.com/paris/les-exposants/britop-lighting-poland-today
---
text: FEELGOOD DESIGNS
url : https://www.maison-objet.com/paris/les-exposants/feelgood-designs-today
---
text: KASZER
url : https://www.maison-objet.com/paris/les-exposants/kaszer-fashion-accessories
---
text: 
url : https://www.maison-objet.com/paris/les-exposants/balma-capoduri-c-s-p-a-smart-gift
---
text: 
url : https://www.maison-objet.com/paris/les-exposants/goodwill-m-g-home-accessories
---
# ...

But result shows other problem - it gets text only for visible elements. Page may use lazy-loading and add elements when user scroll page (and when elements are visible in window). It may need some JavaScript code to scroll elements (driver.execute_script(...))


EDIT:

I had to add /en/ in url to get page in English instead of French.

  • Related