Home > front end >  parse large xml in python
parse large xml in python


I have a very large xml file (about 100mb) with multiple elements similar to the one in this example

    <aixm:DesignatedPoint gml:id="ID_197095_1650420151927_74256">
        <gml:identifier codeSpace="urn:uuid:">084e1bb6-94f7-450f-a88e-44eb465cd5a6</gml:identifier>
            <aixm:DesignatedPointTimeSlice gml:id="ID_197095_1650420151927_74257">
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74258">
                        <gml:endPosition indeterminatePosition="unknown"/>
                    <gml:TimePeriod gml:id="ID_197095_1650420151927_74259">
                        <gml:endPosition indeterminatePosition="unknown"/>
                    <aixm:Point gml:id="ID_197095_1650420151927_74260">
                        <gml:pos srsName="urn:ogc:def:crs:EPSG::4326">40.87555555555556 21.358055555555556</gml:pos>
                    <adrext:DesignatedPointExtension gml:id="ID_197095_1650420151927_74261">
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74262">
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74263">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>
                            <adrext:PointUsage gml:id="ID_197095_1650420151927_74264">
                                    <adrext:AirspaceBorderCrossingObject gml:id="ID_197095_1650420151927_74265">
                                        <adrext:exitedAirspace xlink:href="urn:uuid:78447f69-9671-41c5-a7b7-bdd82c60e978"/>
                                        <adrext:enteredAirspace xlink:href="urn:uuid:afb35b5b-6626-43ff-9d92-875bbd882c05"/>

The ultimate goal is to have in a pandas DataFrame parsed data from this very big xml file.

So far I cannot 'capture' the data that I am looking for. I manage only to 'capture' the last data from the very last element in that large xml file.

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

ab = {'aixm':'http://www.aixm.aero/schema/5.1.1', 'adrext':'http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR', 'gml':'http://www.opengis.net/gml/3.2'}
for point in root.findall('.//aixm:DesignatedPointTimeSlice', ab):
    designator = point.find('.//aixm:designator', ab)
    d = point.find('.//{http://www.aixm.aero/schema/5.1.1}type', ab)
for pos in point.findall('.//gml:pos', ab):
    print(designator.text, pos.text, d.text)

the print statement returns the data that I would like to have but as mentioned, only for the very last element of the file whereas I would like to have the result returned for all of them

ZIFSA 54.02111111111111 27.823888888888888 ICAO

Could I be pls advice on the path I should follow? I need some help pls Thank you very much

CodePudding user response:

Assuming all three needed nodes (aixm:designator, aixm:type, and gml:pos) are always present, consider parsing the parent nodes, aixm:DesignatedPointTimeSlice and axim:Point and then join them. Finally, select the three final columns needed.

import pandas as pd

ab = {

time_slice_df = pd.read_xml(
    'file.xml', xpath=".//aixm:DesignatedPointTimeSlice", namespaces=ab

point_df  = pd.read_xml(
    'file.xml', xpath=".//aixm:Point", namespaces=ab

time_slice_df = (
        ["time_slice_designator", "time_slice_type", "point_pos"], 

CodePudding user response:

Just answered a similar question here

The process I used:

  1. create dict from xml file (courtesy firelion.cis)
  2. explore the dict where to find the variables
  3. make a template df and then collect the data
  • Related