Home > OS >  How to scrape all p-tag and corresponding h2-tag with selenium?
How to scrape all p-tag and corresponding h2-tag with selenium?

Time:08-29

I want to get title and content of article:

<h2><span>Title1</span></h2>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title2</span></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title3</span></h2>
<p>text I want</p>
<p>text I want</p>

and sesult expectect is

title=[title1, title2, title3]
content = [content1,content2,content3]

and append all p string to content1,and append all p string to content2,and append all p string to content3

CodePudding user response:

One approach could be to work with a dict that keys are based on your <h2>, update their values while iterating the <p> by its preceding-sibling::h2[1]:

data = dict((e.text,'') for e in driver.find_elements(By.CSS_SELECTOR,'h2'))

for p in driver.find_elements(By.CSS_SELECTOR,'p'):
    data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] = data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] ' ' p.text

print(data)
-> {'Title1': ' text I want text I want', 'Title2': ' text I want text I want text I want', 'Title3': ' text I want text I want'}

Now you could simply assign your variables by using .keys() and .values():

title = list(data) ### or title = list(data.keys())
-> ['Title1', 'Title2', 'Title3']

content = list(data.values())
-> [' text I want text I want', ' text I want text I want text I want', ' text I want text I want']

But would recommend to stay more structured and working with the dict itself or a list of dicts:

[{'title':x,'content':y} for x,y in data.items()]

Example

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())

html = '''
<h2><span>Title1</span></h2>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title2</span></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title3</span></h2>
<p>text I want</p>
<p>text I want</p>
'''

driver.get("data:text/html;charset=utf-8,"   html)

data = dict((e.text,'') for e in driver.find_elements(By.CSS_SELECTOR,'h2'))
for p in driver.find_elements(By.CSS_SELECTOR,'p'):
    data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] = data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] ' ' p.text

[{'title':x,'content':y} for x,y in data.items()]

Output

[{'title': 'Title1', 'content': ' text I want text I want'},
 {'title': 'Title2', 'content': ' text I want text I want text I want'},
 {'title': 'Title3', 'content': ' text I want text I want'}]
  • Related