Home > Software design >  Python XML Parsing without root v2
Python XML Parsing without root v2

Time:10-20

I have 100,000 XML files that look like this

<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXml>
  <program>BB</program>
  <nazivStavbe>Test build</nazivStavbe>
  <X>101000</X>
  <Y>462000</Y>
  <QNH>24788</QNH>
  <QNC>9698</QNC>
  <Qf>255340</Qf>
  <Qp>597451</Qp>
  <CO2>126660</CO2>
  <An>1010.7</An>
  <Vc>3980</Vc>
  <A>2362.8</A>
  <Ht>0.336</Ht>
  <f0>0.59</f0>
...
</reiXml>

I want to extract around 10 numbers from each, e.g. An, Vc... but i have a problem since the XML files doesn't have a root name. I looked up to other cases on this forum, but I can't seem to replicate their solutions (e.g. link).

So I have basically 2 problems: 1) how to read multiple XML files and 2) extract certain values from it... and put that in 1 txt file with 100,000 rows :(

The final result would be something like:

          An     Vc
XMLfile1 1010.7  3980
XMLfile2 ...     ...
XMLfile3 ...     ...

CodePudding user response:

Can you try beautifulsoup to parse the XML files?

xml_doc = """\
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<reiXml>
  <program>BB</program>
  <nazivStavbe>Test build</nazivStavbe>
  <X>101000</X>
  <Y>462000</Y>
  <QNH>24788</QNH>
  <QNC>9698</QNC>
  <Qf>255340</Qf>
  <Qp>597451</Qp>
  <CO2>126660</CO2>
  <An>1010.7</An>
  <Vc>3980</Vc>
  <A>2362.8</A>
  <Ht>0.336</Ht>
  <f0>0.59</f0>
</reiXml>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_doc, "xml")

print(soup.An.text)
print(soup.Vc.text)

Prints:

1010.7
3980

EDIT: To create a dataframe:

import pandas as pd
from bs4 import BeautifulSoup

files = ["file1.xml", ...other files]

all_data = []
for file in files:
    with open(file, "r") as f_in:
        soup = BeautifulSoup(f_in.read(), "xml")
        all_data.append({"file": file, "An": soup.An.text, "Vc": soup.Vc.text})

df = pd.DataFrame(all_data).set_index("file")
df.index.name = None
print(df)

Prints:

           An    Vc
file1.xml  1010.7  3980
  • Related