XML, ElementTree - Extract attributes and match them to on ID-CodePudding

Hello everyone and greetings from germany!

I'm rather new to python and i have a question concerning XML-files. My data looks something like this (there are a lot of elements in this file, each with a unique way-id):

    <way id="4260867" visible="true" version="12" changeset="71461925" timestamp="2019-06-        
      20T21:42:40Z" user="L___I" uid="7649834">
      <nd ref="25550395"/>
      <nd ref="25550396"/>
      <tag k="bicycle" v="no"/>
      <tag k="bridge" v="yes"/>
      <tag k="foot" v="no"/>
      <tag k="hazmat" v="designated"/>
      <tag k="highway" v="motorway_link"/>
      <tag k="maxspeed" v="none"/>
      <tag k="motorcar" v="yes"/>
      <tag k="oneway" v="yes"/>
      <tag k="placement" v="middle_of:1"/>
      <tag k="source:maxspeed" v="DE:motorway"/>
     </way>
     <way id="312407268" visible="true" version="9" changeset="116383142" 
      timestamp="2022-01-20T12:11:26Z" user="m_p_13" uid="2465271">
      <nd ref="7792523927"/>
      <nd ref="25393142"/>
      <nd ref="5583629192"/>
      <nd ref="25393143"/>
      <tag k="bdouble" v="yes"/>
      <tag k="bicycle" v="no"/>
      <tag k="foot" v="yes"/>
      <tag k="highway" v="secondary"/>
      <tag k="horse" v="yes"/>
      <tag k="lanes" v="2"/>
      <tag k="maxspeed" v="60"/>
      <tag k="motorcar" v="yes"/>
      <tag k="name" v="Messe-Allee"/>
      <tag k="name:etymology:wikidata" v="Q57305"/>
      <tag k="oneway" v="yes"/>
      <tag k="ref" v="K 6529"/>
      <tag k="shoulder" v="no"/>
      <tag k="surface" v="asphalt"/>
     </way>
     <way id="106141287" visible="true" version="3" changeset="101880267" timestamp="2021-03- 
      28T16:10:05Z" user="user_2954791" uid="2954791">
      <nd ref="913936737"/>
      <nd ref="1222080363"/>
      <tag k="bicycle" v="designated"/>
      <tag k="cycleway" v="crossing"/>
      <tag k="smoothness" v="intermediate"/>
      <tag k="surface" v="paving_stones"/>
      <tag k="traffic_sign" v="DE:241"/>
     </way>

What i want to do is extract every ID and match the attributes "nd ref" (node_ids, number differs from way_id to way_id) and (if contains the value "blub"

So in the end it should look something like this:

(id, node_ids, maxspeed)
(4260867, (25550395,25550396), None)
(106141287, (913936737, 1222080363), NaN)

I started to work with elementTree and was able to extract the IDs. I can also print out all attribs from via

for way in root.findall('way'):
   for i in way.findall('tag'): print(i.attrib)

But I'm not able to get those values in the form that i want.

I hope i can get some help! Also if someone has a better way to organize the data instead of tuple i would appreciate that! I dont know if it is important or not but i work with Pycharm.

Thank you in advance!

CodePudding user response：

If I understand you correctly, you are probably looking for something like the below. I chose to run it through pandas, just to demonstrate the structure, but obviously you can do something else if you so choose.

import xml.etree.ElementTree as ET
import pandas as pd

ways = """[your xml above, wrapped in a root element]"""
doc = ET.fromstring(ways)
targets = doc.findall('.//way')
rows= []
cols = ["id", "node_ids", "maxspeed"]
for target in targets:
    id = target.attrib['id']
    nds = [nd.attrib['ref'] for nd in target.findall('.//nd') ]
    ms = target.find(".//tag[@k='maxspeed']").attrib['v'] if target.find(".//tag[@k='maxspeed']") is not None else None
    rows.append([id,nds,ms])
df = pd.DataFrame(rows, columns=cols)
df

Output:

    id  node_ids    maxspeed
0   4260867     [25550395, 25550396]    none
1   312407268   [7792523927, 25393142, 5583629192, 25393143]    60
2   106141287   [913936737, 1222080363]     None

Note: this would be somewhat simpler if you use lxml instead of ElementTree.