How parse XML into list of dicts?-CodePudding

Given the sample xml below:

<_Document>
  <_Data1> 'foo'
    <_SubData1> 'bar1' </_SubData1>
    <_SubData2> 'bar2' </_SubData2>
    <_SubData3> 'bar3' </_SubData3>
  </_Data1>
</_Document>

I want to capture each SubData value and update it with the Data1 value in a dictionary and then append that value to a list. Such that the output would look something like:

[{Data1: 'foo', SubData1: 'bar1'}, {Data1: 'foo', SubData2: 'bar2'}, {Data1: 'foo', SubData3: 'bar3'}]

My code is:

from lxml import etree
import re

new_records = []
   
for child in root.iter('_Document'): #finding all children with each 'Document' string
    for top_data in child.iter(): #iterating through the entirety of each 'Document' sections tags and text. 
        
        if "Data" in top_data.tag:
            for data in top_data:
                rec = {}
                if data.text is not None and data.text.isspace() is False: #avoiding NoneTypes and empty data.
                    g = data.tag.strip("_") #cleaning up the tag
                    rec[g] = data.text.replace("\n", " ") #cleaning up the value
                 
            for b in re.finditer(r'^_SubData', data.tag): #searching through each 'SubData' contained in a given tag. 
                for subdata in data:
                    subdict = {}
                    if subdata.text is not None: #again preventing NoneTypes
                        z = subdata.tag.strip("_") #tag cleaning
                        subdict[z] = subdata.text.replace("\n", " ") #text cleaning
                    rec.update(subdict) #update the data record dictionary with the subdata
                new_records.append(rec) #appending to the list

This, unfortunately, outputs:

[{Data1: 'foo', SubData3: 'bar3'}]

As it only updates and appends the final update of the dictionary.

I've tried different varieties of this including initializing a list after the first 'if' statement in the second for loop to append after each loop pass, but that required quite a bit of clean up at the end to get through the nesting it would cause. I've also tried initializing empty dictionaries outside of the loops to update to preserve the previous updates and append that way.

I'm curious if there is some functionality of lxml that I've missed or a more pythonic approach to get the desired output.

CodePudding user response：

I offered what I think of as a declarative approach in another solution. If you're more comfortable explicitly defining the structure with loops, here's an imperative approach:

from xml.etree import ElementTree as ET
import pprint

new_records = []

document = ET.parse('input.xml').getroot()

for elem in document:
    if elem.tag.startswith('_Data'):
        data = elem
        data_name = data.tag[1:]  # skip leading '_'
        data_val = data.text.strip()

        for elem in data:
            if elem.tag.startswith('_SubData'):
                subdata = elem
                subdata_name = subdata.tag[1:]
                subdata_val = subdata.text.strip()

                new_records.append(
                    {data_name: data_val, subdata_name: subdata_val}
                )

pprint.pprint(new_records)

The input and output is the same as in my other solution.

CodePudding user response：

You can do this with Python's built-in ElementTree class and its iterparse() method which walks an XML tree and produces a pair of event and element for every step through the tree. We listen for when it starts parsing an element, and if its _Data... or _SubData... we act.

This is a declarative approach, and relies on the fact that _SubData is only a child of _Data, that is, that your very small and simple sample is exactly representative of what you're actually dealing with.

You'll need to manage a little state for the _Data elements, but that's it:

from xml.etree import ElementTree as ET
import pprint

new_records = []
data_name = None
data_val = None

for event, elem in ET.iterparse('input.xml', ['start']):
    tag_name = elem.tag[1:]  # skip possible leading '_'

    if event == 'start' and tag_name.startswith('Data'):
        data_name = tag_name
        data_val = elem.text.strip()

    if event == 'start' and tag_name.startswith('SubData'):
        subdata_name = tag_name
        subdata_val = elem.text.strip()
        record = {
            data_name: data_val, subdata_name: subdata_val
        }
        new_records.append(record)

pprint.pprint(new_records)

I modified your sample, my input.xml:

<_Document>
    <_Data1>foo
      <_SubData1>bar1</_SubData1>
      <_SubData2>bar2</_SubData2>
      <_SubData3>bar3</_SubData3>
    </_Data1>
    <_Data2>FOO
      <_SubData1>BAR1</_SubData1>
      <_SubData2>BAR2</_SubData2>
      <_SubData3>BAR3</_SubData3>
    </_Data2>
  </_Document>

When I run my script on that input, I get:

[{'Data1': 'foo', 'SubData1': 'bar1'},
 {'Data1': 'foo', 'SubData2': 'bar2'},
 {'Data1': 'foo', 'SubData3': 'bar3'},
 {'Data2': 'FOO', 'SubData1': 'BAR1'},
 {'Data2': 'FOO', 'SubData2': 'BAR2'},
 {'Data2': 'FOO', 'SubData3': 'BAR3'}]

CodePudding user response：

Consider dictionary comprehension using dictionary merge:

new_records = [
    {
        **{doc.tag.replace('_', ''): doc.text.strip().replace("'", "")},
        **{data.tag.replace('_', ''): data.text.strip().replace("'", "")}
    }
    
    for doc in root.iterfind('*')
    for data in doc.iterfind('*')
]           

new_records
[{'Data1': 'foo', 'SubData1': 'bar1'},
 {'Data1': 'foo', 'SubData2': 'bar2'},
 {'Data1': 'foo', 'SubData3': 'bar3'}]