Suppose I have this html,
<span >
<span >21</span>
Will Green
</span>
I want to extract the name
and age
text and store them into a dictionary.
So far I have been able to get the age
, but getting the name
only has been difficult.
This is what I tried so far.
with open('test.html', 'r') as file:
contents = file.read()
soup = BeautifulSoup(contents, 'html.parser')
name = soup.find(class_="name").getText()
age = soup.find("span", class_="age").getText()
results = {}
results['name'] = name
results['age'] = age
print(results)
The output is {'name': '\n21\n Will Green\n ', 'age': '21'}
As you can see the the name
is giving me some odd characters, spaces and also the text of child element as well.
How can I resolve this?
expected output {'name': 'Will Green', 'age': '21'}
CodePudding user response:
In fact that structure is still the same you could use stripped_strings
and zip()
it with expected keys
:
dict(zip(['age','name'],soup.select_one('span.name').stripped_strings))
An alterntive approach could be to select age first and then its next_sibling
:
{
'age': soup.select_one('span.age').text,
'name':soup.select_one('span.age').next_sibling.get_text(strip=True)
}
Example
html='''
<span >
<span >21</span>
Will Green
</span>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
dict(zip(['age','name'],soup.select_one('span.name').stripped_strings))
Output
{'age': '21', 'name': 'Will Green'}