Home > OS >  How to scrape all p-tag and its corresponding h2-tag with selenium?
How to scrape all p-tag and its corresponding h2-tag with selenium?

Time:08-29

I want to get title and content of article: example web :https://facts.net/best-survival-movies/ I want to append all p in h2[tcontent-title]

enter image description here

and the result expected is:

title=[title1, title2, title3]

content = [content1,content2,content3]

and append all p string to content1,and append all p string to content2,and append all p string to content3 can you help me.

CodePudding user response:

Solution from your last question is not working, cause there are some <p> that are not siblings in the structure, they are nested in an <aside> and the preceding-sibling will fail.

You could switch to preceding only, but this will grab also the <p> from the <aside> - To fix this, simply select the elements more specific:

driver.find_elements(By.CSS_SELECTOR,'.single-title-desc-wrap p:not(aside>p)')

Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get('https://facts.net/best-survival-movies/')
data = dict((e.text,'') for e in driver.find_elements(By.CSS_SELECTOR,'.single-title-desc-wrap h2'))
for p in driver.find_elements(By.CSS_SELECTOR,'.single-title-desc-wrap p:not(aside>p)'):
    data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] = data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] ' ' p.text

[{'title':x,'content':y} for x,y in data.items()]
  • Related