I have a rather big xml file with multiple different elements, similar to the one bellow:
<adrmsg:ADRMessage xmlns:adrmsg="http://www.eurocontrol.int/cfmu/b2b/ADRMessage"
xmlns:gml="http://www.opengis.net/gml/3.2" gml:id="ID_197112_1650420171084_1"
xmlns:adrext="http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR"
xmlns:aixm="http://www.aixm.aero/schema/5.1.1"
xmlns:xlink="http://www.w3.org/1999/xlink">
<adrmsg:hasMember>
<aixm:Airspace gml:id="ID_197112_1650420171084_93332">
<gml:identifier codeSpace="urn:uuid:">3271922d-6b7a-4953-a6ff-599b17ab785e</gml:identifier>
<aixm:timeSlice>
<aixm:AirspaceTimeSlice gml:id="ID_197112_1650420171084_93333">
<gml:validTime>
<gml:TimePeriod gml:id="ID_197112_1650420171084_93334">
<gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
<gml:endPosition indeterminatePosition="unknown"/>
</gml:TimePeriod>
</gml:validTime>
<aixm:interpretation>BASELINE</aixm:interpretation>
<aixm:featureLifetime>
<gml:TimePeriod gml:id="ID_197112_1650420171084_93335">
<gml:beginPosition>2021-10-07T00:00:00</gml:beginPosition>
<gml:endPosition indeterminatePosition="unknown"/>
</gml:TimePeriod>
</aixm:featureLifetime>
<aixm:type>RAS</aixm:type>
<aixm:designator>EDGGNFRA</aixm:designator>
<aixm:name>EDGG NON FRA</aixm:name>
<aixm:designatorICAO>NO</aixm:designatorICAO>
<aixm:geometryComponent>
<aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93336">
<aixm:operation>BASE</aixm:operation>
<aixm:theAirspaceVolume>
<aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93337">
<aixm:upperLimit uom="FL">265</aixm:upperLimit>
<aixm:upperLimitReference>STD</aixm:upperLimitReference>
<aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
<aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
<aixm:contributorAirspace>
<aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93338">
<aixm:dependency>HORZ_PROJECTION</aixm:dependency>
<aixm:theAirspace xlink:href="urn:uuid:5831b5a2-4861-4bf5-ae99-d31413234cdb"/>
</aixm:AirspaceVolumeDependency>
</aixm:contributorAirspace>
</aixm:AirspaceVolume>
</aixm:theAirspaceVolume>
</aixm:AirspaceGeometryComponent>
</aixm:geometryComponent>
<aixm:geometryComponent>
<aixm:AirspaceGeometryComponent gml:id="ID_197112_1650420171084_93339">
<aixm:operation>UNION</aixm:operation>
<aixm:theAirspaceVolume>
<aixm:AirspaceVolume gml:id="ID_197112_1650420171084_93340">
<aixm:upperLimit uom="FL">255</aixm:upperLimit>
<aixm:upperLimitReference>STD</aixm:upperLimitReference>
<aixm:lowerLimit uom="FL">245</aixm:lowerLimit>
<aixm:lowerLimitReference>STD</aixm:lowerLimitReference>
<aixm:contributorAirspace>
<aixm:AirspaceVolumeDependency gml:id="ID_197112_1650420171084_93341">
<aixm:dependency>HORZ_PROJECTION</aixm:dependency>
<aixm:theAirspace xlink:href="urn:uuid:dcd8301c-de12-4e6c-992f-fd8de781ab58"/>
</aixm:AirspaceVolumeDependency>
</aixm:contributorAirspace>
</aixm:AirspaceVolume>
</aixm:theAirspaceVolume>
</aixm:AirspaceGeometryComponent>
</aixm:geometryComponent>
<aixm:extension>
<adrext:AirspaceExtension gml:id="ID_197112_1650420171084_93342">
<adrext:usage>OPERATIONAL</adrext:usage>
</adrext:AirspaceExtension>
</aixm:extension>
</aixm:AirspaceTimeSlice>
</aixm:timeSlice>
</aixm:Airspace>
</adrmsg:hasMember>
.... many other <adrmsg:hasMember>
</adrmsg:ADRMessage>
I only added one of those elements the namespaces .
My attempt of code :
import xml.etree.ElementTree as ET
import pandas as pd
ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
"gml":"http://www.opengis.net/gml/3.2",
"adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
"aixm": "http://www.aixm.aero/schema/5.1.1",
"xlink":"http://www.w3.org/1999/xlink",
"id":"http://www.opengis.net/gml/3.2",
"href":"http://www.w3.org/1999/xlink"
}
root_node = ET.parse('Airspace.xml').getroot()
pipare = []
verate = []
for tag in root_node.findall(".//aixm:Airspace" , ab):
value = tag.find("gml:identifier", ab)
for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
for per in char.findall(".//aixm:type",ab):
for ir in char.findall(".//aixm:name",ab):
for epa in char.findall(".//aixm:designator", ab):
for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
pipare = [value.text, char.attrib,per.text,ir.text,epa.text,op.attrib]
verate.append(pipare)
xml_todf = pd.DataFrame(verate, columns=['uuid','id','type','name','designator','contributorAirspace'])
As you could probably see, I am in a very 'rough' way trying to parse that XML, extract the elements that I am interested in and finally put them into a pandas DataFrame.
When I am 'capturing' the .text the data extracted is what I want, but when it comes about capturing the attributes, the result is not only the values but also the namespaces...I dont know what to do to solve this. Let me share how the pandas DataFrame displays that data:
uuid | id | type | name | designator | contributorAirspace |
---|---|---|---|---|---|
3271922d-6b7a-4953-a6ff-599b17ab785e | {'{http://www.opengis.net/gml/3.2}id': 'ID_197112_1650420171084_93333'} | RAS | EDGG NON FRA | EDGGNFRA | {'{http://www.w3.org/1999/xlink}href': 'urn:uuid:5831b5a2-4861-4bf5-ae99-d31413234cdb'} |
3271922d-6b7a-4953-a6ff-599b17ab785e | {'{http://www.opengis.net/gml/3.2}id': 'ID_197112_1650420171084_93333'} | RAS | EDGG NON FRA | EDGGNFRA | {'{http://www.w3.org/1999/xlink}href': 'urn:uuid:dcd8301c-de12-4e6c-992f-fd8de781ab58'} |
I would like to have ideally something like this:
uuid | id | type | name | designator | contributorAirspace |
---|---|---|---|---|---|
3271922d-6b7a-4953-a6ff-599b17ab785e | 'ID_197112_1650420171084_93333'} | RAS | EDGG NON FRA | EDGGNFRA | 5831b5a2-4861-4bf5-ae99-d31413234cdb , dcd8301c-de12-4e6c-992f-fd8de781ab58 |
but I would be very grateful if somebody could help me reach this point:
uuid | id | type | name | designator | contributorAirspace |
---|---|---|---|---|---|
3271922d-6b7a-4953-a6ff-599b17ab785e | 'ID_197112_1650420171084_93333'} | RAS | EDGG NON FRA | EDGGNFRA | 5831b5a2-4861-4bf5-ae99-d31413234cdb |
3271922d-6b7a-4953-a6ff-599b17ab785e | 'ID_197112_1650420171084_93333'} | RAS | EDGG NON FRA | EDGGNFRA | dcd8301c-de12-4e6c-992f-fd8de781ab58 |
Thanks for your help
CodePudding user response:
Python elementtree requires to address attribute with the namespace by its qualified name (i.e. namespace attribute name). When referring char.attrib
or op.attrib
a dictionary containing all element attributes with their values is retrieved. Here is an example of attribute value retrieval:
import xml.etree.ElementTree as ET
import pandas as pd
from collections import defaultdict
ab = {"adrmsg":"http://www.eurocontrol.int/cfmu/b2b/ADRMessage",
"gml":"http://www.opengis.net/gml/3.2",
"adrext":"http://www.aixm.aero/schema/5.1.1/extensions/EUR/ADR",
"aixm": "http://www.aixm.aero/schema/5.1.1",
"xlink":"http://www.w3.org/1999/xlink",
"id":"http://www.opengis.net/gml/3.2",
"href":"http://www.w3.org/1999/xlink"
}
# parse XML
root_node = ET.fromstring(xml)
# create dictionary to store parsed data
data = defaultdict(list)
for tag in root_node.findall(".//aixm:Airspace" , ab):
value = tag.find("gml:identifier", ab)
for char in tag.findall(".//aixm:AirspaceTimeSlice", ab):
for per in char.findall(".//aixm:type",ab):
for ir in char.findall(".//aixm:name",ab):
for epa in char.findall(".//aixm:designator", ab):
for op in char.findall(".//aixm:theAirspace[@xlink:href]", ab):
data['uuid'].append(value.text)
data['id'].append(char.attrib['{http://www.opengis.net/gml/3.2}id'])
data['type'].append(per.text)
data['name'].append(ir.text)
data['designator'].append(epa.text)
#data['contributorAirspace'].append(op.attrib['{http://www.w3.org/1999/xlink}href'])
df = pd.DataFrame(data)
Note the expressions char.attrib['{http://www.opengis.net/gml/3.2}id']
and op.attrib['{http://www.w3.org/1999/xlink}href']
. They address attrbutes using qualified names and retrieve attrbute values.
Also this example uses defaultdict instead of two lists, but that's matter of taste.