Home > Enterprise >  Best means to parse through multiple subelements within XML
Best means to parse through multiple subelements within XML

Time:02-01

I am attempting to use xmltodict to parse through XML in to hopes of eventually converting to a more readable table format for others. I have been able to get through most of the XML but when I come to an element with multiple subelements, I feel I am chasing my tail. My hope is to use panda with the values I extract from the XML...

Here is a sanitized version of the XML I am attempting to parse:

  <batchConfiguration>
    <batchJob name="BATCHJOB1">
      <className>batchJob1</className>
      <schedule>Y</schedule>
      <interval>300</interval>
      <systemControlled>N</systemControlled>
    </batchJob>
    <batchJob name="BATCHJOB2">
      <params>
        <param name="QueueName1">batchQueue1</param>
      </params>
      <className>batchJob2</className>
      <startTime>02:10:00</startTime>
      <schedule>N</schedule>
      <daysOfTheWeek>YYYYYYY</daysOfTheWeek>
      <systemControlled>N</systemControlled>
    </batchJob>
    <batchJob name="BATCHJOB3">
      <params>
        <param name="ignoreErrors">Y</param>
        <param name="batchSize">1000</param>
      </params>
      <className>classyBatchJob</className>
      <schedule>Y</schedule>
      <interval>90</interval>
      <systemControlled>N</systemControlled>
    </batchJob>
  </batchConfiguration>

My thought was I could somehow loop through the lines where there are multiple "params". I can return a single line of "params" but stumped when there are multiple. Here is my code to date. It has pieces parts where I try to figure things as I go. The XML is read from a file...

import xmltodict as xml
import pprint

#File to parse
fileptr=open(r"FileIRead.xml")

# Show raw XML text file data
raw_file= fileptr.read()
# print(raw_file)

# Create an XML dictionary
xml_dict=xml.parse(raw_file)
pprint.pprint(xml_dict)

xml_dict1=xml.parse(raw_file)['batchConfiguration']['batchJob']
pprint.pprint(xml_dict1)
# pprint.pprint(xml_dict['batchConfiguration']['batchJob'])

# https://docs.python.org/3/tutorial/errors.html

for bJ in xml_dict1:
    bJName=bJ['@name']
    print(f"Name: {bJ['@name']}")
    print(bJName)
    try:
        print(f"Interval: {bJ['interval']}")
    except:
        print("Interval: N/A")
    try:
        print(f"Scheduled: {bJ['schedule']}")
    except:
        print("N/A")
    try:
        print(f"Start Time: {bJ['startTime']}")
    except:
        print("Start Time: N/A")
    try:
        print(f"End Time: {bJ['endTime']}")
    except:
        print("End Time: N/A")
    try:
        # This works fine to return only a single element. With multiple it fails.
        print(f"Params: {bJ['params']['param']['@name']} - {bJ['params']['param']['#text']}")
    except:
        print("Params: N/A")
    try:
        print(f"Classname: {bJ['className']}")
    except:
        print("Classname: N/A")
    try:
        print(f"DaysOfWeek: {bJ['daysOfTheWeek']}")
    except:
        print("DaysOfWee: N/A")
    try:
        # Attempt to get all parameters single or multiple
        xml_dict2=xml.parse(raw_file)['params']['param']
        pprint.pprint(xml_dict2)
        for bJ1 in xml_dict2['params']['param']:
            print(f"--- {bJ1['@name']}")
    except:
        print("It no worky")

Edit: By request... The output I have been able to get is:

Name: BATCHJOB1
Classname: batchJob1
... (etc)

My end goal is to take the output and put it into column format something like this:

Name            Classname    ...
BATCHJOB1       batchJob1

"N/A" would be placed where the element does not exist or has no value.

CodePudding user response:

xmltodict is only returning a dict when it is one parameter, but a list when it is two or more. There is a force_list parameter to .parse that allows keys to be indicated that should always be lists.

You could use:

xml_dict1 = xml.parse(raw_file, force_list=('param',))['batchConfiguration']['batchJob']

Then:

try:
    for p in bJ['params']['param']:
        print(f"Params: {p['@name']} - {p['#text']}")
except KeyError: # recommend never use bare 'except'
    print("Params: N/A")

CodePudding user response:

If I understand you correctly, this can be accomplished by using pandas.read_xml():

import pandas as pd
pd.read_xml([your_xml]).iloc[:,0:2]

Output, based on your sample xml:

      name      className
0   BATCHJOB1   batchJob1
1   BATCHJOB2   batchJob2
2   BATCHJOB3   classyBatchJob
  • Related