Home > Software design >  Trying to get following element (text) without class tag etc
Trying to get following element (text) without class tag etc

Time:12-04

Here's html code of page I'm trying to parse. (Its a bookstore) Part of the page code

<tr><tr>
<tr><tr>
<tr><tr>
<tr><tr>
<tr><tr>
<tr>
    <td width="300" class="highlight">
        <b>Издатель:</b>
         Додо Пресс,Фантом Пресс 
    </td>
</tr>
<tr><tr>
<tr><tr>
<tr><tr>

I need to get text that is following

<b>Издатель:</b> (translation - Publisher)

First i used nextsibling from BeautifulSoup, it worked fine, but on other books' pages on the same site publisher element is't always in the same place which means my chain of next siblings doesn't get the right part of book description.

I tried to locate the exact text 'Издатель:' with Selenium

pubs = driver.find_element(By.XPATH, "//*[text()='Издатель:']")

and it did the job. I got the text 'Издатель:'. After that i tried to locate next element following 'Издатель:' because the text that i need is always located after 'Издатель:'.

followingsibling form Selenium doest work because publishers' name doesn't have class or tag etc.

I also tried running JS

pubs = driver.find_element(By.XPATH, "//*[text()='Издатель:']")
pub = driver.execute_script("""
    return arguments[0].nextElement""", pubs)
pub = driver.execute_script("return document.evaluate('// [text()='Издатель:']/following-sibling::text()[1]'), document, null, XPathResult.FIRST_ORDERED_NODE_TYPE,null).singleNodeValue.textContent;")

Also didn't work.

Publisher element doesn't have any sibling or child element so i don't know how to get the text following it.

Site URL - https://www.bgshop.ru/Catalog/GetFullDescription?id=10652263&type=1

CodePudding user response:

The text Додо Пресс,Фантом Пресс is within a Text Node so you have to use execute_script() inducing WebDriverWait for the element_to_be_clickable() and you can use either of the following Locator Strategies:

  • Code Block:

    driver.get("https://www.bgshop.ru/Catalog/GetFullDescription?id=10652263&type=1")
    WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.collapsed"))).click()
    print(driver.execute_script('return arguments[0].lastChild.textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[text()='Издатель:']//ancestor::td[1]")))).strip())
    driver.quit()
    
  • Console Output:

    Додо Пресс,Фантом Пресс
    

References

You can find a couple of relevant detailed discussion in:

CodePudding user response:

You can achieve this with the javascript code below. You can select every b element and then get its parrent element and access innerText property

document.querySelectorAll('b').forEach( element => {
  console.log(element.parentElement.innerText)
})
<table>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr>
      <td width="300" class="highlight">
          <b>Publisher:</b>
          name 1
      </td>
  </tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr>
      <td width="300" class="highlight">
          <b>Publisher:</b>
          name 2 
      </td>
  </tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr>
      <td width="300" class="highlight">
          <b>Publisher:</b>
           name 3 
      </td>
  </tr>
</table>
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

If there are other b tags then you can check with if statment if the content of b is publisher liek below

document.querySelectorAll('b').forEach( element => {
  if(element.innerText == 'Publisher:'){
    console.log(element.parentElement.innerText);
  }
})
<table>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr>
      <td width="300" class="highlight">
          <b>Publisher:</b>
          name 1
      </td>
  </tr>
  <tr></tr>
  <tr>
      <td width="300" class="highlight">
          <b>Date:</b>
          Date 1
      </td>
  </tr>
  <tr></tr>
  <tr></tr>
  <tr>
      <td width="300" class="highlight">
          <b>Publisher:</b>
          name 2 
      </td>
  </tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr></tr>
  <tr>
      <td width="300" class="highlight">
          <b>Publisher:</b>
           name 3 
      </td>
  </tr>
</table>
<iframe name="sif2" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

  • Related