Parsing subfields in XML and merging with matching columns-CodePudding

This is a follow-up question from here. it got lost due to high amount of other topic on this forum. Maybe i presented the question too complicated. Since then I improved and simplified the approach. To sum up: i'd like to extract data from subfields in multiple XML files and attach those to a new df on a matching positions.

This is a sample XML-1:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
  <Qfl>1808</Qfl>
  <fOVE>13.7</fOVE>
  <NetoVolumen>613</NetoVolumen>
  <Hv>104.2</Hv>
  <energenti>
    <energent>
      <sifra>energy_e</sifra>
      <naziv>EE [kWh]</naziv>
      <vrednost>238981</vrednost>
    </energent>
    <energent>
      <sifra>energy_to</sifra>
      <naziv>Do</naziv>
      <vrednost>16359</vrednost>
    </energent>
  <rei>
    <zavetrovanost>2</zavetrovanost>
    <cone>
      <cona>
        <cona_id>1</cona_id>
        <cc_si_cona>1110000</cc_si_cona>
        <visina_cone>2.7</visina_cone>
        <dolzina_cone>14</dolzina_cone>
      </cona>
      <cona>
        <cona_id>2</cona_id>
        <cc_si_cona>120000</cc_si_cona>
      </cona>
  </rei>
</reiXmlPrenos>

his is a sample XML-2:

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
  <Qfl>1808</Qfl>
  <fOVE>13.7</fOVE>
  <NetoVolumen>613</NetoVolumen>
  <Hv>104.2</Hv>
  <energenti>
    <energent>
      <sifra>energy_e</sifra>
      <naziv>EE [kWh]</naziv>
      <vrednost>424242</vrednost>
    </energent>
    <energent>
      <sifra>energy_en</sifra>
      <naziv>Do</naziv>
      <vrednost>29</vrednost>
    </energent>
  <rei>
    <zavetrovanost>2</zavetrovanost>
    <cone>
      <cona>
        <cona_id>1</cona_id>
        <cc_si_cona>1110000</cc_si_cona>
        <visina_cone>2.7</visina_cone>
        <dolzina_cone>14</dolzina_cone>
      </cona>
      <cona>
        <cona_id>2</cona_id>
        <cc_si_cona>120000</cc_si_cona>
      </cona>
  </rei>
</reiXmlPrenos>

My code:

import xml.etree.ElementTree as ETree
import pandas as pd

xmldata = r"C:\...\S1.xml"
prstree = ETree.parse(xmldata)
root = prstree.getroot()


# print(root)
store_items = []
all_items = []

for storeno in root.iter('energent'):
    
    cona_sifra = storeno.find('sifra').text
    cona_vrednost = storeno.find('vrednost').text
    store_items = [cona_sifra, cona_vrednost]
    all_items.append(store_items)

xmlToDf = pd.DataFrame(all_items, columns=[
'sifra', 'vrednost'])

print(xmlToDf.to_string(index=False))

This results in:

    sifra        vrednost
 energy_e         238981
energy_to          16359

Which is fine for 1 example. But i have 1,000 of XML files and the wish is to 1) have all results in 1 row for each XML and 2) to differentiate between different 'sifra' codes.

There can be e.g. energy_e, energy_en, energy_to

So ideally the final df would look like this

xml       energy_e   energy_en   energy_to
xml-1    238981      0         16539 
xml-2    424242      29        0

can it be done?

CodePudding user response：

If I understand the situation correctly, this can be done - but because of the complexity, I would use here lxml, instead of ElementTree.

I'll try to annotate the code a bit, but you'll have to really do read up on this.

By the way, the two xml files you posted are not well formed (closing tags for <energenti> and <cone> are missing), but assuming that is fixed - try this:

from lxml import etree

xmls =[XML-1,XML-2]
#note: For simplicity, I'm using the well formed version of the xml strings in your question; you'll have to use actual file names and paths
energies = ["xml", "energy_e", "energy_en", "energy_to", "whatever"]
#I just made up some names - you'll have to use actual names, of course; the first one is for the file identifier - see below
rows = []
for xml in xmls:
    row = []
    id = "xml-" str(xmls.index(xml) 1)
    #this creates the file identifier
    row.append(id)
    root = etree.XML(xml.encode())
    #in real life, you'll have to use the parse() method
    
    for energy in energies[1:]:
        #the '[1:]' is used to skip the first "energy"; it's only used as the file identifier
        target = root.xpath(f'//energent[./sifra[.="{energy}"]]/vrednost/text()')
        #note the use of f-strings
        row.extend( target if len(target)>0 else "0" )
    rows.append(row)

print(pd.DataFrame(rows,columns=energies))

Output:

    xml    energy_e energy_en energy_to  whatever
0  xml-1   238981         0     16359        0
1  xml-2   424242        29         0        0

CodePudding user response：

Simply use pandas.read_xml since the part of the XML you need is a flat part of the document:

energy_df = pd.read_xml("Input.xml", xpath=".//energent")                  # IF lxml INSTALLED

energy_df = pd.read_xml("Input.xml", xpath=".//energent", parser="etree")  # IF lxml NOT INSTALLED

And to bind across many XML files, simply build a list of data frames from a list of XML file paths, adding a column for source file, and then run pandas.concat to row bind all into a single data frame:

xml_files = [...]

energy_dfs = [
    pd.read_xml(f, xpath=".//energent", parser="etree").assign(source=f) for f in xml_files
]

energy_long_df = pd.concat(energy_dfs, ignore_index=True)

And from your desired output, you can then pivot values from sifra columns with pivot_table:

energy_wide_df = energy_long_df.pivot_table(
    values="vrednost", index="source", columns="sifra", aggfunc="sum"
)