Does anyone know how to extract the text from each
spanin a
p tag using beautifulsoup
? I'm trying to figure this out in python. I'm using a craigslist car listing.
This is what I was able to accomplish so far:
#retrieve post spans
spans = soup.find_all(class_='attrgroup')
print(spans[1].prettify())
Ideally, I'm trying to create a dictionary. Example:
dict = {
"condition": "good",
"cylinders": "8 cylinders",
"drive": 4wd,
etc.
}
OUPUT
<p ><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>
CodePudding user response:
Try this:
from bs4 import BeautifulSoup
sample_html = """
<p >
<span>
condition:
<b>
good
</b>
</span>
<br/>
<span>
cylinders:
<b>
8 cylinders
</b>
</span>
<br/>
<span>
drive:
<b>
4wd
</b>
</span>
<br/>
<span>
fuel:
<b>
gas
</b>
</span>
<br/>
<span>
odometer:
<b>
138000
</b>
</span>
<br/>
<span>
paint color:
<b>
blue
</b>
</span>
<br/>
<span>
size:
<b>
full-size
</b>
</span>
<br/>
<span>
title status:
<b>
clean
</b>
</span>
<br/>
<span>
transmission:
<b>
automatic
</b>
</span>
<br/>
<span>
type:
<b>
pickup
</b>
</span>
<br/>
</p>
"""
your_text = [
i.getText(strip=True).split(":") for i
in BeautifulSoup(sample_html, 'html.parser').select("span")
]
print({k: v for k, v in your_text})
Output:
{'condition': 'good', 'cylinders': '8 cylinders', 'drive': '4wd', 'fuel': 'gas', 'odometer': '138000', 'paint color': 'blue', 'size': 'full-size', 'title status': 'clean', 'transmission': 'automatic', 'type': 'pickup'}
CodePudding user response:
You could use stripped_strings
in case pattern is always the same
Example
from bs4 import BeautifulSoup
html='''<p ><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>'''
soup=BeautifulSoup(html)
dict(s.stripped_strings for s in soup.select('.attrgroup span'))
Output
{'condition:': 'good',
'cylinders:': '8 cylinders',
'drive:': '4wd',
'fuel:': 'gas',
'odometer:': '138000',
'paint color:': 'blue',
'size:': 'full-size',
'title status:': 'clean',
'transmission:': 'automatic',
'type:': 'pickup'}