Python3 - Extract the text from a bs4.element.Tag and add to a dictonary-CodePudding

I am scraping a website which returns a bs4.element.Tag similar to the following:

<span > 
        <span >four door</span>
        <span >inline 4 engine</span>
        <span >24 gallons per mile</span>
</span>

I am trying to extract just the text from this block and add it to a dictionary. All of the examples that I am seeing on the forum include some sort of common element like an 'id' or similar. I am not an html guy so i may be using incorrect terms.

What I would like to do is get the text ("four door", "v6 engine", etc) and add them as values to a dictionary with the key being a pre-designated variable of car_model.

cars = {'528i':['four door', 'inline 4 engine']}

I cant figure out a universal way to pull out the text because there may be more or fewer span classes with different text. Thanks for your help!

CodePudding user response：

You can use:

out = defaultdict(list)

soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.select(".attributes-value span"):
   out["528i"].append(tag.text)

print(dict(out))

Prints:

{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}

CodePudding user response：

You need to loop through all the elements by selector and extract text value from these elements.

A selector is a specific path to the element you want. In my case, the selector is .attributes-value span, where .attributes-value allows you to access the class, and span allows you to access the tags within that class.

The get_text() method retrieves the content between the opening and closing tags. This is exactly what you need.

I also recommend using lxml because it will speed up your code.

The full code is attached below:

from bs4 import BeautifulSoup
import lxml

html = '''
<span > 
        <span >four door</span>
        <span >inline 4 engine</span>
        <span >24 gallons per mile</span>
</span>
'''

soup = BeautifulSoup(html, 'lxml')

cars = {
    '528i': []
}

for span in soup.select(".attributes-value span"):
    cars['528i'].append(span.get_text())

print(cars)

Output:

{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}