I'm new to web scraping and have been learning Scrapy and Selenium for the last couple of days. I'm trying to extract some info from this source: https://www.kodda.co.kr/kr/information/member.php To be specific, I need a company name, CEO name, and email. So far I was able to write code that clicks buttons to navigate to various pages. Currently, I want to extract the info from the table of a given page.
For example, given this table:
The screenshot of a table code
I'd like to extract the text inside the first tag. When I write this code on scrapy shell: response.xpath('//table[@]/tbody/tr[1]/td[1]/text()').get()
it returns what's inside the first (which is what I want). But when I write this exact code on a .py file and run it, it returns empty (""):
import scrapy
class CompanyInfoSpider(scrapy.Spider):
name = 'company_info'
allowed_domains = ['https://www.kodda.co.kr/kr/information/member.php']
start_urls = ['http://https://www.kodda.co.kr/kr/information/member.php/']
def parse(self, response):
print(response.xpath('//table[@]/tbody/tr[1]/td[1]/text()').get())
I tried the same thing using Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(desired_capabilities=show_browser(False))
driver.get("https://www.kodda.co.kr/kr/information/member.php")
driver.implicitly_wait(10)
column_element = driver.find_element(By.XPATH, '//table[@]/tbody/tr[1]/td[1]')
column_text = column_element.text
time.sleep(10)
print(column_text)
But this also returns empty (""). I've been googling for hours but couldn't find any possible reason.
Note: I've also tried explicit wait:
ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)
wait = WebDriverWait(driver, 50, ignored_exceptions=ignored_exceptions)
wait.until(lambda wd: column_text != "")
Attempted: wait.until(expected_conditions.visibility_of_all_elements_located((By.CLASS_NAME, "sub-table")))
but these also returned empty ("")
CodePudding user response:
Solved the issue! I tracked it down to text
function. For some reason that I didn't care to search for, text
doesn't work when I use it to extract the text of an element. Instead, get_attribute("innerText")
worked!
CodePudding user response:
You are using a wrong locator.
This locator matches 10 elements on the page, but these elements are not visible, at least not the first one. Since you are using driver.find_element
method it returns you the first match of the passed locator on the page.
Also you should use Expected Conditions explicit waits, not a implicitly_wait
since the former method waits for element existence only, it will not wait for the element complete rendered. So using this method you are getting the column_element
element on the stage when it is still not fully rendered, still not populated with the text content it will finally have.