How to open html with many branches?-CodePudding

There are might be many same tags at one brach. How to save all of them at dataframe?

I tried the next code, but the repeated tags such as RowData are replaces with futher data. My aim is to save full data.

import pandas as pd
from xml.etree import ElementTree

path=str('data.xml')

with open(path, mode="r", encoding="utf-8") as f:
    xml_file = f.read()

items_delete=['<ObjectRelation>','</ObjectRelation>','<List>','</List>','<RowData>','</RowData>','<Kind>','</Kind>']

for item in items_delete:
    xml_file=xml_file.replace(item, '')

df = pd.read_xml(xml_file)

enter image description here

Example of initial data:

<ItemList>
    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>
                <Kind>
                    <KindId>7312050201</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju</KindName>
                </Kind>
                <Nr>1</Nr>
                <EstablishDate>1997-02-24</EstablishDate>
                <Area>0.0127</Area>
                <Measure>ha</Measure>
            </RowData>
            <RowData>
                <Kind>
                    <KindId>7312040200</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju</KindName>
                </Kind>
                <Nr>3</Nr>
                <EstablishDate>1996-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>

    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>

                <Kind>
                    <KindId>7312060100</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi</KindName>
                </Kind>
                <Nr>5</Nr>
                <EstablishDate>1997-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>
<ItemList>

CodePudding user response：

you can find the element and then delete it. This is XML, so need to find the parent before deleting the child. Following is the idea, of how this would work like.comments are added in added in code. hope this helps!

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')

items_delete=['ObjectRelation','List','RowData','Kind']
#items_delete=['ObjectRelation']
for item in items_delete:
    for e in tree.findall(f'.//{item}/..'): # find the parent of a element
        child = e.find(f'./{item}') # get to the element
        e.remove(child) # remove element
tree.write('output.xml')

CodePudding user response：

You can try to parse the document with beautifulsoup:

import pandas as pd
from bs4 import BeautifulSoup

xml_doc = """\
<ItemList>
    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>
                <Kind>
                    <KindId>7312050201</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju</KindName>
                </Kind>
                <Nr>1</Nr>
                <EstablishDate>1997-02-24</EstablishDate>
                <Area>0.0127</Area>
                <Measure>ha</Measure>
            </RowData>
            <RowData>
                <Kind>
                    <KindId>7312040200</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju</KindName>
                </Kind>
                <Nr>3</Nr>
                <EstablishDate>1996-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>

    <ItemData>
        <ObjectRelation>
            <ObjectCadastreNr>01000180062</ObjectCadastreNr>
            <ObjectType>PARCEL</ObjectType>
        </ObjectRelation>
        <List>
            <RowData>

                <Kind>
                    <KindId>7312060100</KindId>
                    <KindName>ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi</KindName>
                </Kind>
                <Nr>5</Nr>
                <EstablishDate>1997-01-13</EstablishDate>
            </RowData>
        </List>
    </ItemData>
<ItemList>"""

soup = BeautifulSoup(xml_doc, "xml")

all_data = []
for data in soup.select("RowData"):
    d = {}
    d["ObjectCadastreNr"] = data.find_previous("ObjectCadastreNr").text.strip()
    d["ObjectType"] = data.find_previous("ObjectType").text.strip()

    for t in data.find_all(text=True):
        if t.strip() == "":
            continue
        d[t.parent.name] = t.strip()

    all_data.append(d)

df = pd.DataFrame(all_data)
print(df)

Prints:

  ObjectCadastreNr ObjectType      KindId                                                                                       KindName Nr EstablishDate    Area Measure
0      01000180062     PARCEL  7312050201                     ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju  1    1997-02-24  0.0127      ha
1      01000180062     PARCEL  7312040200          ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju  3    1996-01-13     NaN     NaN
2      01000180062     PARCEL  7312060100  ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi  5    1997-01-13     NaN     NaN