Get the parent text and child one separately and store them in dictionary-CodePudding

Suppose I have this html,

<span >
  <span >21</span>
  Will Green
</span>

I want to extract the name and age text and store them into a dictionary.

So far I have been able to get the age, but getting the name only has been difficult.

This is what I tried so far.

with open('test.html', 'r') as file:
    contents = file.read()    
    soup = BeautifulSoup(contents, 'html.parser')
    
    name = soup.find(class_="name").getText()
    age = soup.find("span", class_="age").getText()

    results = {}
    results['name'] = name
    results['age'] = age

    print(results)

The output is {'name': '\n21\n Will Green\n ', 'age': '21'}

As you can see the the name is giving me some odd characters, spaces and also the text of child element as well.

How can I resolve this?

expected output {'name': 'Will Green', 'age': '21'}

CodePudding user response：

In fact that structure is still the same you could use stripped_strings and zip() it with expected keys:

dict(zip(['age','name'],soup.select_one('span.name').stripped_strings))

An alterntive approach could be to select age first and then its next_sibling:

{
    'age': soup.select_one('span.age').text,
    'name':soup.select_one('span.age').next_sibling.get_text(strip=True)
}

Example

html='''
<span >
  <span >21</span>
  Will Green
</span>
'''
from bs4 import BeautifulSoup 

soup = BeautifulSoup(html)
dict(zip(['age','name'],soup.select_one('span.name').stripped_strings))

Output

{'age': '21', 'name': 'Will Green'}