I have 100,000 XML files that look like this
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXml>
<program>BB</program>
<nazivStavbe>Test build</nazivStavbe>
<X>101000</X>
<Y>462000</Y>
<QNH>24788</QNH>
<QNC>9698</QNC>
<Qf>255340</Qf>
<Qp>597451</Qp>
<CO2>126660</CO2>
<An>1010.7</An>
<Vc>3980</Vc>
<A>2362.8</A>
<Ht>0.336</Ht>
<f0>0.59</f0>
...
</reiXml>
I want to extract around 10 numbers from each, e.g. An, Vc... but i have a problem since the XML files doesn't have a root name. I looked up to other cases on this forum, but I can't seem to replicate their solutions (e.g. link).
So I have basically 2 problems: 1) how to read multiple XML files and 2) extract certain values from it... and put that in 1 txt file with 100,000 rows :(
The final result would be something like:
An Vc
XMLfile1 1010.7 3980
XMLfile2 ... ...
XMLfile3 ... ...
CodePudding user response:
Can you try beautifulsoup
to parse the XML files?
xml_doc = """\
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXml>
<program>BB</program>
<nazivStavbe>Test build</nazivStavbe>
<X>101000</X>
<Y>462000</Y>
<QNH>24788</QNH>
<QNC>9698</QNC>
<Qf>255340</Qf>
<Qp>597451</Qp>
<CO2>126660</CO2>
<An>1010.7</An>
<Vc>3980</Vc>
<A>2362.8</A>
<Ht>0.336</Ht>
<f0>0.59</f0>
</reiXml>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_doc, "xml")
print(soup.An.text)
print(soup.Vc.text)
Prints:
1010.7
3980
EDIT: To create a dataframe:
import pandas as pd
from bs4 import BeautifulSoup
files = ["file1.xml", ...other files]
all_data = []
for file in files:
with open(file, "r") as f_in:
soup = BeautifulSoup(f_in.read(), "xml")
all_data.append({"file": file, "An": soup.An.text, "Vc": soup.Vc.text})
df = pd.DataFrame(all_data).set_index("file")
df.index.name = None
print(df)
Prints:
An Vc
file1.xml 1010.7 3980