Extract text from span without an id in beautifulsoup-CodePudding

Does anyone know how to extract the text from each spanin ap tag using beautifulsoup? I'm trying to figure this out in python. I'm using a craigslist car listing.

This is what I was able to accomplish so far:

#retrieve post spans
spans = soup.find_all(class_='attrgroup')
print(spans[1].prettify())

Ideally, I'm trying to create a dictionary. Example:

dict = {
  "condition": "good",
  "cylinders": "8 cylinders",
  "drive": 4wd,
   etc.
}

OUPUT

<p ><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>

CodePudding user response：

Try this:

from bs4 import BeautifulSoup

sample_html = """
<p >
       <span>
        condition:
        <b>
         good
        </b>
       </span>
       <br/>
       <span>
        cylinders:
        <b>
         8 cylinders
        </b>
       </span>
       <br/>
       <span>
        drive:
        <b>
         4wd
        </b>
       </span>
       <br/>
       <span>
        fuel:
        <b>
         gas
        </b>
       </span>
       <br/>
       <span>
        odometer:
        <b>
         138000
        </b>
       </span>
       <br/>
       <span>
        paint color:
        <b>
         blue
        </b>
       </span>
       <br/>
       <span>
        size:
        <b>
         full-size
        </b>
       </span>
       <br/>
       <span>
        title status:
        <b>
         clean
        </b>
       </span>
       <br/>
       <span>
        transmission:
        <b>
         automatic
        </b>
       </span>
       <br/>
       <span>
        type:
        <b>
         pickup
        </b>
       </span>
       <br/>
      </p>
"""

your_text = [
    i.getText(strip=True).split(":") for i
    in BeautifulSoup(sample_html, 'html.parser').select("span")
]
print({k: v for k, v in your_text})

Output:

{'condition': 'good', 'cylinders': '8 cylinders', 'drive': '4wd', 'fuel': 'gas', 'odometer': '138000', 'paint color': 'blue', 'size': 'full-size', 'title status': 'clean', 'transmission': 'automatic', 'type': 'pickup'}

CodePudding user response：

You could use stripped_strings in case pattern is always the same

Example

from bs4 import BeautifulSoup
html='''<p ><span>condition:<b>good</b></span><br/><span>cylinders:<b>8 cylinders</b></span><br/><span>drive:<b>4wd</b></span><br/><span>fuel:<b>gas</b></span><br/><span>odometer:<b>138000</b></span><br/><span>paint color:<b>blue</b></span><br/><span>size:<b>full-size</b></span><br/><span>title status:<b>clean</b></span><br/><span>transmission:<b>automatic</b></span><br/><span>type:<b>pickup</b></span><br/></p>'''

soup=BeautifulSoup(html)

dict(s.stripped_strings for s in soup.select('.attrgroup span'))

Output

{'condition:': 'good',
 'cylinders:': '8 cylinders',
 'drive:': '4wd',
 'fuel:': 'gas',
 'odometer:': '138000',
 'paint color:': 'blue',
 'size:': 'full-size',
 'title status:': 'clean',
 'transmission:': 'automatic',
 'type:': 'pickup'}