Home > Net >  Scrapy returns the text of an element in shell but not in the code
Scrapy returns the text of an element in shell but not in the code

Time:02-18

I'm new to web scraping and have been learning Scrapy and Selenium for the last couple of days. I'm trying to extract some info from this source: https://www.kodda.co.kr/kr/information/member.php To be specific, I need a company name, CEO name, and email. So far I was able to write code that clicks buttons to navigate to various pages. Currently, I want to extract the info from the table of a given page.

For example, given this table: The screenshot of a table code I'd like to extract the text inside the first tag. When I write this code on scrapy shell: response.xpath('//table[@]/tbody/tr[1]/td[1]/text()').get() it returns what's inside the first (which is what I want). But when I write this exact code on a .py file and run it, it returns empty (""):

import scrapy

class CompanyInfoSpider(scrapy.Spider):
    name = 'company_info'
    allowed_domains = ['https://www.kodda.co.kr/kr/information/member.php']
    start_urls = ['http://https://www.kodda.co.kr/kr/information/member.php/']

    def parse(self, response):
        print(response.xpath('//table[@]/tbody/tr[1]/td[1]/text()').get())

I tried the same thing using Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome(desired_capabilities=show_browser(False))
driver.get("https://www.kodda.co.kr/kr/information/member.php")
driver.implicitly_wait(10)

column_element = driver.find_element(By.XPATH, '//table[@]/tbody/tr[1]/td[1]')
column_text = column_element.text
time.sleep(10)

print(column_text)

But this also returns empty (""). I've been googling for hours but couldn't find any possible reason.

Note: I've also tried explicit wait:

ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)
wait = WebDriverWait(driver, 50, ignored_exceptions=ignored_exceptions)
wait.until(lambda wd: column_text != "")

Attempted: wait.until(expected_conditions.visibility_of_all_elements_located((By.CLASS_NAME, "sub-table"))) but these also returned empty ("")

CodePudding user response:

Solved the issue! I tracked it down to text function. For some reason that I didn't care to search for, text doesn't work when I use it to extract the text of an element. Instead, get_attribute("innerText") worked!

CodePudding user response:

You are using a wrong locator.
This locator matches 10 elements on the page, but these elements are not visible, at least not the first one. Since you are using driver.find_element method it returns you the first match of the passed locator on the page.
Also you should use Expected Conditions explicit waits, not a implicitly_wait since the former method waits for element existence only, it will not wait for the element complete rendered. So using this method you are getting the column_element element on the stage when it is still not fully rendered, still not populated with the text content it will finally have.

  • Related