I have the HTML code below:
<div >
<div >
<div >
I need to extract the Id of each product presented in the class description using beatiful soup (31121/ 31301/ 28416 are the ids) how can i do that ?
CodePudding user response:
- Select all the div's that starts with post-.
- Iterate all the class names of that div to filter out the classname which starts with post-.
- add post id to the list.
import re
html_attr='''
<div >
<div >
<div >'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_attr, 'html.parser')
div_list = soup.find_all('div', {"class": re.compile("^post-")})
id_list = []
for div in div_list:
post_id = [name.split('-')[1] for name in div['class'] if name.startswith('post-')][0]
id_list.append(post_id)
print(id_list)
Output
['31121', '31301', '28416']
CodePudding user response:
Iterate over your selection extract class
attribute, iterate over its classes and pick class
starts with post-
:
[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]
or
[c.split('-')[-1] for e in soup.select('div[class*="post-"]') for c in e['class'] if c.startswith('post-')]
Example
html = '''
<div >
<div >
<div >
'''
soup = BeautifulSoup(html)
[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]
output
['31121', '31301', '28416']