Home > Software engineering >  How to scrape content from different html tags
How to scrape content from different html tags

Time:04-09

Source website:

<div >
<h1>Example 1</h1>
<p>Example 2</p>
<h3>Exmaple 3</h3>
</div>

My Code:

content=driver.find_elements(By.XPATH,'//div[@id="content"]/h1')
full_content=""
for des in content:
    full_content ='\n\n' des.text
    suggest=[page_link,full_content]
    print(suggest)

I don't want to scrape everything from inside the 'content' class, only text from certain tags like h1 h3, but i want all that within full_content. Can i do it with selenium?

CodePudding user response:

Not id, if your html doc example is correct then //div[@]/h1 is also correct.

content=driver.find_elements(By.XPATH,'//div[@]/h1')

CodePudding user response:

The <h1> element is the immediate descendant of it's parent <div>

So to scrape the texts from certain tags you can use either of the following Locator Strategies:

  • innerText from <h1> tag:

    print(driver.find_element(By.XPATH, "//div[@class='content']/h1").text)
    
  • innerText from <h3> tag:

    print(driver.find_element(By.XPATH, "//div[@class='content']//h3").text)
    
  • Related