I'm trying to extract data from several 1,000 XML files and compose a single df from it.
The code I have so far is for a single XML extraction.
from lxml import etree
import pandas as pd
serial = ["S1.xml"]
content = serial.encode('utf-8')
doc = etree.XML(content)
targets = doc.xpath('/reiXmlPrenos')
data = []
for target in targets:
data.append(target.xpath("./@A")[0])
data.append(target.xpath("./@z")[0])
columns = ['A', 'Z']
pd.DataFrame([data],columns=columns)
The XML file looks like this:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXmlPrenos>
<Qf>255340</Qf>
<Qp>597451</Qp>
<CO2>126660</CO2>
<A>2362.8</A>
<Ht>0.336</Ht>
<f0>0.59</f0>
<z>0.105891</z>
</reiXmlPrenos>
For the final df I'd like for it to look like this:
A z
S1.xml 2362 0.105891
S2.xml ... ...
...
The error that i'm getting is
line 16, in <module>
content = serial.encode('utf-8')
AttributeError: 'list' object has no attribute 'encode'
Can you please find me the error that i'm making and then to expand the code, so it could load all xml files in the same folder?
CodePudding user response:
from lxml import etree
import pandas as pd
serial = ["tmp.xml", "S2.xml"]
columns = ["file",'A', 'Z']
all_data = []
for item in serial:
data = []
data.append(item)
with open(item, 'r') as file:
content = file.read().encode('utf-8')
doc = etree.XML(content)
# add a predicate to make sure A and z exists
targets = doc.xpath('/reiXmlPrenos[A and z]')
for target in targets:
data.append(target.xpath("./A")[0].text)
data.append(target.xpath("./z")[0].text)
all_data.append(data)
df = pd.DataFrame(all_data,columns=columns)
print(df)
Result
file A Z
0 tmp.xml 2362.8 0.105891
1 S2.xml 2362.8 0.105891
CodePudding user response:
Using only Pandas (lxml under the hood):
import pandas as pd
# file S1 same as S2, for demonstration
serial = ["S1.xml", "S2.xml"]
# To save money, we first collect dataframes in the generator, then combine them.
df = pd.concat((pd.read_xml(file, xpath='//reiXmlPrenos')[['A', 'z']] for file in serial))
# Adding a column for indexing.
df['serial'] = serial
df = df.set_index('serial')
print(df)
A z
serial
S1.xml 2362.8 0.105891
S2.xml 2362.8 0.105891
CodePudding user response:
To import data from an XML file using lxml, simply create an lxml.etree.ElementTree instance, and pass it the file name of the XML file. The data will be automatically parsed and stored in the instance:
tree = lxml.etree.ElementTree(file='myfile.xml')
To access the data, simply use the instance's methods and attributes. For example, to get the root element of the XML file, use the getroot() method:
root = tree.getroot()