I'm trying to convert the below xml data to a dataframe.
<?xml version="1.0" encoding="utf-8"?>
<TEST>
<Node1L1>1</Node1L1>
<Node2L1>FP</Node2L1>
<SUBL1>
<M>
<PAR>
<NAME>A</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>1,2,3,4,5,6</VAL>
</PAR>
<PAR>
<NAME>B</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>10,20,30,40,50,60</VAL>
</PAR>
<PAR>
<NAME>C</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>11,22,33,44,55,66</VAL>
</PAR>
<PAR>
<NAME>D</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>a,b,c,d,e,f</VAL>
</PAR>
<PAR>
<NAME>E</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>aa,bb,cc,dd,ee,ff</VAL>
</PAR>
</M>
<M>
<PAR>
<NAME>A_test</NAME>
<VAL>0.0,0.1,0.2,0.3,0.4,0.5</VAL>
</PAR>
</M>
</SUBL1>
</TEST>
I need to extract only the first M tag PAR child nodes for the names A,C,E
This is just a sample file but the file I have is a large one with lots of PAR tags in 2 M tags. I was able to do the XML conversion using the below code but it also takes the second M tag PAR tags as well.
df = pd.read_xml(path2file, xpath="//*[local-name()='PAR']")
I'm trying to find a way to improve the xpath=
string so it will extract only the first M tag data into a dataframe.
Also if there is any alternate method, please let me know. I would also like to avoid the empty DESC column in the nodes as well.
CodePudding user response:
See below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="utf-8"?>
<TEST>
<Node1L1>1</Node1L1>
<Node2L1>FP</Node2L1>
<SUBL1>
<M>
<PAR>
<NAME>A</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>1,2,3,4,5,6</VAL>
</PAR>
<PAR>
<NAME>B</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>10,20,30,40,50,60</VAL>
</PAR>
<PAR>
<NAME>C</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>11,22,33,44,55,66</VAL>
</PAR>
<PAR>
<NAME>D</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>a,b,c,d,e,f</VAL>
</PAR>
<PAR>
<NAME>E</NAME>
<TYPE>f</TYPE>
<DESC />
<VAL>aa,bb,cc,dd,ee,ff</VAL>
</PAR>
</M>
<M>
<PAR>
<NAME>A_test</NAME>
<VAL>0.0,0.1,0.2,0.3,0.4,0.5</VAL>
</PAR>
</M>
</SUBL1>
</TEST>'''
root = ET.fromstring(xml)
pars = list(root.find('.//M'))
data = [[p.text for p in list(par) if p.text] for par in pars if par.find('NAME').text in ['A','C','E']]
df = pd.DataFrame(data,columns = ['NAME','TYPE','VAL'])
print(df)
output
NAME TYPE VAL
0 A f 1,2,3,4,5,6
1 C f 11,22,33,44,55,66
2 E f aa,bb,cc,dd,ee,ff
CodePudding user response:
From docs of pandas.read_xml()
you see that lxml
is used as a default parser.
Unfortunately, lxml
(as well as built-in xml.etree.ElementTree
) doesn't support XPath 2.0, so generic solution .//M[1]/PAR[NAME=('A','C','E')]
won't suite. But we can use an XPath 1.0 alternative — .//M[1]/PAR[NAME='A' or NAME='C' or NAME='E']
.
Final code will be next:
df = pd.read_xml(
path2file,
xpath=".//M[1]/PAR[NAME='A' or NAME='C' or NAME='E']"
)
P.S. Haven't tested it, so pin me if it doesn't work for some reason.