BeautifulSoup - xml - find_next: limiting to one attribute-CodePudding

any help is appreciated! Using the example XML file below, I'm getting the incorrect output.

Incorrect output: 
Emp_F_Name: Jill
Emp_M_Name: H
Emp_L_Name: Jones

Desired output:
Emp_F_Name: Jill
Emp_M_Name: None or NULL
Emp_L_Name: Jones

I'm not sure why the find_next function is going outside the declared attribute (employee).

   <?xml version="1.0" encoding="utf-8"?>
    <org value="Tech">
    <employee>
    <name>
    <family>Jones</family>
    <given>Jill</given>
    </name>
    </employee>
    <manager>
    <name>
    <family>Fisher</family>
    <given>Junior</given>
    <given>H</given>
    </name>
    </manager>
    </org>

Here's the code I'm using.

employee = soup.find("employee")

for i in employee.find_all('name'):
    fname = employee.find('given')
    print("Emp_F_Name: ", fname.get_text())
    
    mname = fname.find_next('given')
    print("Emp_M_Name: ", mname.get_text())
    
    lname = employee.find('family')
    print("Emp_L_Name: ", lname.get_text())

When I run the same code but for the manager, it seem to work.

manager = soup.find("manager")

CodePudding user response：

If the structure is almost identical, you can try to 'find_all()' all elements of given and check if there is only one or two.

given= i.find_all('given')
fname = given[0]
print("Emp_F_Name: ", fname.get_text())
    
mname = given[1].get_text() if len(given) > 1 else None
print("Emp_M_Name: ", mname)

Think there is no need to iterate over employee but if so, you should use your i

Example

import requests
from bs4 import BeautifulSoup

xml='''<?xml version="1.0" encoding="utf-8"?>
    <org value="Tech">
    <employee>
    <name>
    <family>Jones</family>
    <given>Jill</given>
    </name>
    </employee>
    <manager>
    <name>
    <family>Fisher</family>
    <given>Junior</given>
    <given>H</given>
    </name>
    </manager>
    </org>'''

soup = BeautifulSoup(xml, 'lxml')

employee = soup.find("employee")

for i in employee.find_all('name'):
    given= i.find_all('given')
    fname = given[0]
    print("Emp_F_Name: ", fname.get_text())
    
    mname = given[1].get_text() if len(given) > 1 else None
    print("Emp_M_Name: ", mname)
    
    lname = i.find('family')
    print("Emp_L_Name: ", lname.get_text())

Output

Emp_F_Name:  Jill
Emp_M_Name:  None
Emp_L_Name:  Jones

Alternativ

Isolate employee as separat tree to operate with find_next():

employee = BeautifulSoup(str(soup.find("employee")), 'lxml')

for i in employee.find_all('name'):
    fname = i.find('given')
    print("Emp_F_Name: ", fname.get_text())
    
    mname = fname.find_next('given').get_text() if fname.find_next('given') else None
    print("Emp_M_Name: ", mname)
    
    lname = i.find('family')
    print("Emp_L_Name: ", lname.get_text())

CodePudding user response：

Using XML parser:(no need for any external library)

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="UTF-8"?>
<org value="Tech">
   <employee>
      <name>
         <family>Jones</family>
         <given>Jill</given>
      </name>
   </employee>
   <manager>
      <name>
         <family>Fisher</family>
         <given>Junior</given>
         <given>H</given>
      </name>
   </manager>
</org>'''



attrs = {'Emp_F_Name':'given',
         'Emp_L_Name':'family',
         'Emp_M_Name': None}

root = ET.fromstring(xml)
name = root.find('.//name')
for k,v in attrs.items():
  print(f'{k}: {name.find(v).text if v else None}')

output

Emp_F_Name: Jill
Emp_L_Name: Jones
Emp_M_Name: None