I want to get title
and content
of article:
<h2><span>Title1</span></h2>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title2</span></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title3</span></h2>
<p>text I want</p>
<p>text I want</p>
and sesult expectect is
title=[title1, title2, title3]
content = [content1,content2,content3]
and append all p string to content1,and append all p string to content2,and append all p string to content3
CodePudding user response:
One approach could be to work with a dict
that keys are based on your <h2>
, update their values while iterating the <p>
by its preceding-sibling::h2[1]
:
data = dict((e.text,'') for e in driver.find_elements(By.CSS_SELECTOR,'h2'))
for p in driver.find_elements(By.CSS_SELECTOR,'p'):
data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] = data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] ' ' p.text
print(data)
-> {'Title1': ' text I want text I want', 'Title2': ' text I want text I want text I want', 'Title3': ' text I want text I want'}
Now you could simply assign your variables by using .keys()
and .values()
:
title = list(data) ### or title = list(data.keys())
-> ['Title1', 'Title2', 'Title3']
content = list(data.values())
-> [' text I want text I want', ' text I want text I want text I want', ' text I want text I want']
But would recommend to stay more structured and working with the dict
itself or a list of dicts:
[{'title':x,'content':y} for x,y in data.items()]
Example
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
html = '''
<h2><span>Title1</span></h2>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title2</span></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2><span>Title3</span></h2>
<p>text I want</p>
<p>text I want</p>
'''
driver.get("data:text/html;charset=utf-8," html)
data = dict((e.text,'') for e in driver.find_elements(By.CSS_SELECTOR,'h2'))
for p in driver.find_elements(By.CSS_SELECTOR,'p'):
data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] = data[p.find_element(By.XPATH, './preceding-sibling::h2[1]').text] ' ' p.text
[{'title':x,'content':y} for x,y in data.items()]
Output
[{'title': 'Title1', 'content': ' text I want text I want'},
{'title': 'Title2', 'content': ' text I want text I want text I want'},
{'title': 'Title3', 'content': ' text I want text I want'}]