convert XML to DataFrame using python script-CodePudding

I'm trying to convert the below xml data to a dataframe.

<?xml version="1.0" encoding="utf-8"?>
<TEST>
    <Node1L1>1</Node1L1>
    <Node2L1>FP</Node2L1>
    <SUBL1>
        <M>
            <PAR>
                <NAME>A</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>1,2,3,4,5,6</VAL>
            </PAR>
            <PAR>
                <NAME>B</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>10,20,30,40,50,60</VAL>
            </PAR>
            <PAR>
                <NAME>C</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>11,22,33,44,55,66</VAL>
            </PAR>
            <PAR>
                <NAME>D</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>a,b,c,d,e,f</VAL>
            </PAR>
            <PAR>
                <NAME>E</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>aa,bb,cc,dd,ee,ff</VAL>
            </PAR>
        </M>
        <M>
            <PAR>
                <NAME>A_test</NAME>
                <VAL>0.0,0.1,0.2,0.3,0.4,0.5</VAL>
            </PAR>
        </M>            
    </SUBL1>
</TEST>

I need to extract only the first M tag PAR child nodes for the names A,C,E

This is just a sample file but the file I have is a large one with lots of PAR tags in 2 M tags. I was able to do the XML conversion using the below code but it also takes the second M tag PAR tags as well.

df = pd.read_xml(path2file, xpath="//*[local-name()='PAR']")

I'm trying to find a way to improve the xpath= string so it will extract only the first M tag data into a dataframe. Also if there is any alternate method, please let me know. I would also like to avoid the empty DESC column in the nodes as well.

CodePudding user response：

See below

import xml.etree.ElementTree as ET
import pandas as pd


xml = '''<?xml version="1.0" encoding="utf-8"?>
<TEST>
    <Node1L1>1</Node1L1>
    <Node2L1>FP</Node2L1>
    <SUBL1>
        <M>
            <PAR>
                <NAME>A</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>1,2,3,4,5,6</VAL>
            </PAR>
            <PAR>
                <NAME>B</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>10,20,30,40,50,60</VAL>
            </PAR>
            <PAR>
                <NAME>C</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>11,22,33,44,55,66</VAL>
            </PAR>
            <PAR>
                <NAME>D</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>a,b,c,d,e,f</VAL>
            </PAR>
            <PAR>
                <NAME>E</NAME>
                <TYPE>f</TYPE>
                <DESC />
                <VAL>aa,bb,cc,dd,ee,ff</VAL>
            </PAR>
        </M>
        <M>
            <PAR>
                <NAME>A_test</NAME>
                <VAL>0.0,0.1,0.2,0.3,0.4,0.5</VAL>
            </PAR>
        </M>            
    </SUBL1>
</TEST>'''

root = ET.fromstring(xml)
pars = list(root.find('.//M'))
data = [[p.text for p in  list(par) if p.text] for par in pars if par.find('NAME').text in ['A','C','E']]

df = pd.DataFrame(data,columns = ['NAME','TYPE','VAL'])
print(df)

output

  NAME TYPE                VAL
0    A    f        1,2,3,4,5,6
1    C    f  11,22,33,44,55,66
2    E    f  aa,bb,cc,dd,ee,ff

CodePudding user response：

From docs of pandas.read_xml() you see that lxml is used as a default parser.

Unfortunately, lxml (as well as built-in xml.etree.ElementTree) doesn't support XPath 2.0, so generic solution .//M[1]/PAR[NAME=('A','C','E')] won't suite. But we can use an XPath 1.0 alternative — .//M[1]/PAR[NAME='A' or NAME='C' or NAME='E'].

Final code will be next:

df = pd.read_xml(
    path2file,
    xpath=".//M[1]/PAR[NAME='A' or NAME='C' or NAME='E']"
)

P.S. Haven't tested it, so pin me if it doesn't work for some reason.