Here's html code of page I'm trying to parse. (Its a bookstore) Part of the page code
<tr><tr>
<tr><tr>
<tr><tr>
<tr><tr>
<tr><tr>
<tr>
<td width="300" class="highlight">
<b>Издатель:</b>
Додо Пресс,Фантом Пресс
</td>
</tr>
<tr><tr>
<tr><tr>
<tr><tr>
I need to get text that is following
<b>Издатель:</b>
(translation - Publisher)
First i used nextsibling
from BeautifulSoup, it worked fine, but on other books' pages on the same site publisher element is't always in the same place which means my chain of next siblings doesn't get the right part of book description.
I tried to locate the exact text 'Издатель:' with Selenium
pubs = driver.find_element(By.XPATH, "//*[text()='Издатель:']")
and it did the job. I got the text 'Издатель:'. After that i tried to locate next element following 'Издатель:' because the text that i need is always located after 'Издатель:'.
followingsibling
form Selenium doest work because publishers' name doesn't have class or tag etc.
I also tried running JS
pubs = driver.find_element(By.XPATH, "//*[text()='Издатель:']")
pub = driver.execute_script("""
return arguments[0].nextElement""", pubs)
pub = driver.execute_script("return document.evaluate('// [text()='Издатель:']/following-sibling::text()[1]'), document, null, XPathResult.FIRST_ORDERED_NODE_TYPE,null).singleNodeValue.textContent;")
Also didn't work.
Publisher element doesn't have any sibling or child element so i don't know how to get the text following it.
Site URL - https://www.bgshop.ru/Catalog/GetFullDescription?id=10652263&type=1
CodePudding user response:
The text Додо Пресс,Фантом Пресс is within a Text Node so you have to use execute_script()
inducing WebDriverWait for the element_to_be_clickable() and you can use either of the following Locator Strategies:
Code Block:
driver.get("https://www.bgshop.ru/Catalog/GetFullDescription?id=10652263&type=1") WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.collapsed"))).click() print(driver.execute_script('return arguments[0].lastChild.textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[text()='Издатель:']//ancestor::td[1]")))).strip()) driver.quit()
Console Output:
Додо Пресс,Фантом Пресс
References
You can find a couple of relevant detailed discussion in:
- How to extract just the number from html?
- How to extract text from webdriver elements found through xpath using Selenium and Python
- How do I use selenium to scrape text from a text node within a class through Python
CodePudding user response:
You can achieve this with the javascript code below. You can select every b
element and then get its parrent element and access innerText
property
document.querySelectorAll('b').forEach( element => {
console.log(element.parentElement.innerText)
})
<table>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr>
<td width="300" class="highlight">
<b>Publisher:</b>
name 1
</td>
</tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr>
<td width="300" class="highlight">
<b>Publisher:</b>
name 2
</td>
</tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr>
<td width="300" class="highlight">
<b>Publisher:</b>
name 3
</td>
</tr>
</table>
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>
If there are other b
tags then you can check with if statment if the content of b
is publisher liek below
document.querySelectorAll('b').forEach( element => {
if(element.innerText == 'Publisher:'){
console.log(element.parentElement.innerText);
}
})
<table>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr>
<td width="300" class="highlight">
<b>Publisher:</b>
name 1
</td>
</tr>
<tr></tr>
<tr>
<td width="300" class="highlight">
<b>Date:</b>
Date 1
</td>
</tr>
<tr></tr>
<tr></tr>
<tr>
<td width="300" class="highlight">
<b>Publisher:</b>
name 2
</td>
</tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr></tr>
<tr>
<td width="300" class="highlight">
<b>Publisher:</b>
name 3
</td>
</tr>
</table>
<iframe name="sif2" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>