I am scraping a website which returns a bs4.element.Tag similar to the following:
<span >
<span >four door</span>
<span >inline 4 engine</span>
<span >24 gallons per mile</span>
</span>
I am trying to extract just the text from this block and add it to a dictionary. All of the examples that I am seeing on the forum include some sort of common element like an 'id' or similar. I am not an html guy so i may be using incorrect terms.
What I would like to do is get the text ("four door", "v6 engine", etc) and add them as values to a dictionary with the key being a pre-designated variable of car_model.
cars = {'528i':['four door', 'inline 4 engine']}
I cant figure out a universal way to pull out the text because there may be more or fewer span classes with different text. Thanks for your help!
CodePudding user response:
You can use:
out = defaultdict(list)
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.select(".attributes-value span"):
out["528i"].append(tag.text)
print(dict(out))
Prints:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}
CodePudding user response:
You need to loop through all the elements by selector and extract text value from these elements.
A selector is a specific path to the element you want. In my case, the selector is .attributes-value span
, where .attributes-value
allows you to access the class, and span
allows you to access the tags within that class.
The get_text()
method retrieves the content between the opening and closing tags. This is exactly what you need.
I also recommend using lxml
because it will speed up your code.
The full code is attached below:
from bs4 import BeautifulSoup
import lxml
html = '''
<span >
<span >four door</span>
<span >inline 4 engine</span>
<span >24 gallons per mile</span>
</span>
'''
soup = BeautifulSoup(html, 'lxml')
cars = {
'528i': []
}
for span in soup.select(".attributes-value span"):
cars['528i'].append(span.get_text())
print(cars)
Output:
{'528i': ['four door', 'inline 4 engine', '24 gallons per mile']}