I am trying to extraxt the review text from this page.
Here's a condensed version of the html shown in my chrome browser inspector:
<div id="module_product_review" >
<div >
<div data-spm="ratings_reviews" lazada_pdp_review="expose" itemid="1615006548" data-nosnippet="true" data-aplus-ae="x1_490e4591" data-spm-anchor-id="a2o42.pdp_revamp.0.ratings_reviews.508466b1OJjCoH">
<div>...</div>
<div>...</div>
<div>
<div >
<div >
<div >...</div>
<div >...</div>
<div >
<div data-spm-anchor-id="a2o42.pdp_revamp.ratings_reviews.i3.508466b1OJjCoH">Slim and light. feel good. better if providing 16G version.</div>
<div >...></div>
<div >Color Family:MYSTIC SILVER</div>
<div >...</div>
<div ></div>
</div>
<div >...</div>
<div >...</div>
<div >...</div>
<div >...</div>
<div >...</div>
</div>
</div>
</div>
</div>
</div>
I'm trying to extract the "Slim and light. feel good. better if providing 16G version." text from the element.
But when I try to retrieve the id="module_product_review"
element using Selenium in python, this is what I get instead:
<div id="module_product_review">
<div >
<div >
<div >
</div>
</div>
</div>
</div>
This is my code:
op = webdriver.ChromeOptions()
op.add_argument('--headless')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=op)
driver.get("https://www.lazada.sg/products/huawei-matebook-d14-laptop-14-fullview-display-intel-i5-processor-8gb512gb-intel-uhd-graphics-i1615006548-s7594078907.html?spm=a2o42.searchlist.list.3.15064828Od60kh&search=1&freeshipping=1")
module_product_review = driver.find_element(By.ID, "module_product_review")
html = module_product_review.get_attribute("outerHTML")
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
I thought it might have been because I was retrieving the element before it was fully loaded, so I tried to sleep the program for 30 seconds before calling find_element()
, but I still get the same result. As far as I can tell, it's not an issue of iframes or shadow roots either.
Is there some other issue that I'm missing?
CodePudding user response:
The element you are trying to access and to get it's text is initially out of the visible view. You have first to scroll that element into the view.
Also, since you are working in headless mode you should set the window size. The default window size in headless mode is much smaller than we normally use.
And you should use expected conditions explicit waits to access the elements only when they are ready for that.
This should work better:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
op = webdriver.ChromeOptions()
op.add_argument('--headless')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=op)
options.add_argument("window-size=1920,1080")
wait = WebDriverWait(driver, 20)
actions = ActionChains(driver)
driver.get("https://www.lazada.sg/products/huawei-matebook-d14-laptop-14-fullview-display-intel-i5-processor-8gb512gb-intel-uhd-graphics-i1615006548-s7594078907.html?spm=a2o42.searchlist.list.3.15064828Od60kh&search=1&freeshipping=1")
element = wait.until(EC.presence_of_element_located((By.ID, "module_product_review")))
time.sleep(1)
actions.move_to_element(element).perform()
module_product_review = wait.until(EC.visibility_of_element_located((By.ID, "module_product_review")))
#now you can do what you want here
html = module_product_review.get_attribute("outerHTML")
Also, in order to find that specific element and get that specific text you could use something more precise, like this:
your_text = wait.until(EC.visibility_of_element_located((By.XPATH, "(//div[@id='module_product_review']//div[@class='item']//div[@class='content'])[1]"))).text
You can use this after scrolling, as mentioned above