I am trying to get the value of an element that renders text upon clicking a dropdown. I am currently using implicity_wait()
to make sure the element is appearing, but when I run the script, the .text
call returns empty strings. If I slowly run each line of the script the .text
values populate. Based on this i assume that I have to wait for the text to render, but I can't work out how to do this.
Looking at the expected conditions
documentation all the of the text_to_be_present_...
conditions want me to know what text I am waiting for. Since I am webscraping I don't know this and so I am trying to pass a regex condition to the text_
argument, that matches a generic form of the value I am looking for. I am not getting the expected result with the value still returning an empty string when I run the script.
Here is the code I am trying:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
#Set the options for running selenium as headless
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
#Create the driver object
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.implicitly_wait(10)
output = []
driver.get(html)
nat_res_element = driver.find_element_by_xpath('//*[@id="accordion-theme"]/div[1]/div[1]/span')
nat_res_element.click()
element = WebDriverWait(driver, 10).until(EC.text_to_be_present_in_element_value(locator = By.xpath('//*[@id="collapse0"]/div/div/ul/li/span[2]'), text_ = '[\d].*'))
output.append(element.text)
The url is: https://projects.worldbank.org/en/projects-operations/project-detail/P159382
. I am trying to access the values under the 'Environment and Natural Resource Management' dropdown. Since this is digit; digit; %
, I am trying regex [\d].*
.
Welcome a way to handle this.
CodePudding user response:
climate_change = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '(//*[@]//li//span)[2]'))).text
adaptation = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '(//*[@]//li//span)[4]'))).text
mitigation = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '(//*[@]//li//span)[6]'))).text
The above xpath expressions will pull the desired data from the 'Environment and Natural Resource Management' dropdown.
It's working fine with non-headless browser.
Full Script:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("--window-size=1920,1200")
#options.add_argument("--headless")
s = Service("./chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=s, options=options)
url = 'https://projects.worldbank.org/en/projects-operations/project-detail/P159382'
driver.get(url)
time.sleep(5)
nat_res_element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="accordion-theme"]/div[1]/div[1]/span')))
nat_res_element.click()
data=[]
climate_change = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '(//*[@]//li//span)[2]'))).text
adaptation = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '(//*[@]//li//span)[4]'))).text
mitigation = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '(//*[@]//li//span)[6]'))).text
data.append({
'Climate change':climate_change,
'Adaptation':adaptation,
'Mitigation':mitigation
})
print(data)
driver.quit()
Output:
[{'Climate change': '64%', 'Adaptation': '32%', 'Mitigation': '32%'}]
CodePudding user response:
I usually like to combine Selenium with BeautifulSoup. Thank for sharing all the details, this would be my approach:
driver.get("https://projects.worldbank.org/en/projects-operations/project-detail/P159382")
raw_source = driver.page_source
parsed = BeautifulSoup(raw_source,"html.parser")
variables = [x.text for x in parsed.find_all(class_='table-accordion-wrapper ta-block ng-star-inserted')[0].find_all(class_='proj-theme')]
values = [x.text for x in parsed.find_all(class_='table-accordion-wrapper ta-block ng-star-inserted')[0].find_all(class_='proj-theme-percentage')]
df = pd.DataFrame({'variables':variables,'values':values})
print(df)
Returns:
variables values
0 Climate change 64%
1 Adaptation 32%
2 Mitigation 32%
The first find_all accesses the Theme table, which contains 4 (expandables) tables. Given we only want the first one, I am forcing a [0]
after the first find_all()
. (but if you'd like the other values from the other tables you can make a listed nest comprehension).
The second find_all()
, iterates over the rows in the subtable, accessing Climate, Adaptation and Mitigation.
You can of course further manipulate to generate a formar you'd like such as:
df = df.set_index('variables').T
Returning:
variables Climate change Adaptation Mitigation
values 64% 32% 32%
CodePudding user response:
text_to_be_present_in_element_value()
text_to_be_present_in_element_value()
is the expectation for checking if the given text is present in the element’s value and is defined as:
def text_to_be_present_in_element_value(locator, text_):
"""
An expectation for checking if the given text is present in the element's value.
locator, text
"""
def _predicate(driver):
try:
element_text = driver.find_element(*locator).get_attribute("value")
return text_ in element_text
except StaleElementReferenceException:
return False
return _predicate
This usecase
You need to consider a couple of things here as follows:
- Expected Condition of
text_to_be_present_in_element_value()
checks if the given text is present in the element's value attribute but not the text / innerText which is 64% - Expected Condition doesn't support regex, as a result the supplied regex
[\d].*
will be considered as a string.
Solution
To extract the text 64% ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:
Using CSS_SELECTOR and text attribute:
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div#collapse0 ul.twolevel li.firstlevel span.proj-theme span"))).text)
Using XPATH and
get_attribute("innerHTML")
:print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[.='Climate change']//following::span[1]"))).get_attribute("innerHTML"))
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python