There are might be many same tags at one brach. How to save all of them at dataframe?
I tried the next code, but the repeated tags such as RowData are replaces with futher data. My aim is to save full data.
import pandas as pd
from xml.etree import ElementTree
path=str('data.xml')
with open(path, mode="r", encoding="utf-8") as f:
xml_file = f.read()
items_delete=['<ObjectRelation>','</ObjectRelation>','<List>','</List>','<RowData>','</RowData>','<Kind>','</Kind>']
for item in items_delete:
xml_file=xml_file.replace(item, '')
df = pd.read_xml(xml_file)
Example of initial data:
<ItemList>
<ItemData>
<ObjectRelation>
<ObjectCadastreNr>01000180062</ObjectCadastreNr>
<ObjectType>PARCEL</ObjectType>
</ObjectRelation>
<List>
<RowData>
<Kind>
<KindId>7312050201</KindId>
<KindName>ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju</KindName>
</Kind>
<Nr>1</Nr>
<EstablishDate>1997-02-24</EstablishDate>
<Area>0.0127</Area>
<Measure>ha</Measure>
</RowData>
<RowData>
<Kind>
<KindId>7312040200</KindId>
<KindName>ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju</KindName>
</Kind>
<Nr>3</Nr>
<EstablishDate>1996-01-13</EstablishDate>
</RowData>
</List>
</ItemData>
<ItemData>
<ObjectRelation>
<ObjectCadastreNr>01000180062</ObjectCadastreNr>
<ObjectType>PARCEL</ObjectType>
</ObjectRelation>
<List>
<RowData>
<Kind>
<KindId>7312060100</KindId>
<KindName>ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi</KindName>
</Kind>
<Nr>5</Nr>
<EstablishDate>1997-01-13</EstablishDate>
</RowData>
</List>
</ItemData>
<ItemList>
CodePudding user response:
you can find the element and then delete it. This is XML, so need to find the parent before deleting the child. Following is the idea, of how this would work like.comments are added in added in code. hope this helps!
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
items_delete=['ObjectRelation','List','RowData','Kind']
#items_delete=['ObjectRelation']
for item in items_delete:
for e in tree.findall(f'.//{item}/..'): # find the parent of a element
child = e.find(f'./{item}') # get to the element
e.remove(child) # remove element
tree.write('output.xml')
CodePudding user response:
You can try to parse the document with beautifulsoup
:
import pandas as pd
from bs4 import BeautifulSoup
xml_doc = """\
<ItemList>
<ItemData>
<ObjectRelation>
<ObjectCadastreNr>01000180062</ObjectCadastreNr>
<ObjectType>PARCEL</ObjectType>
</ObjectRelation>
<List>
<RowData>
<Kind>
<KindId>7312050201</KindId>
<KindName>ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju</KindName>
</Kind>
<Nr>1</Nr>
<EstablishDate>1997-02-24</EstablishDate>
<Area>0.0127</Area>
<Measure>ha</Measure>
</RowData>
<RowData>
<Kind>
<KindId>7312040200</KindId>
<KindName>ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju</KindName>
</Kind>
<Nr>3</Nr>
<EstablishDate>1996-01-13</EstablishDate>
</RowData>
</List>
</ItemData>
<ItemData>
<ObjectRelation>
<ObjectCadastreNr>01000180062</ObjectCadastreNr>
<ObjectType>PARCEL</ObjectType>
</ObjectRelation>
<List>
<RowData>
<Kind>
<KindId>7312060100</KindId>
<KindName>ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi</KindName>
</Kind>
<Nr>5</Nr>
<EstablishDate>1997-01-13</EstablishDate>
</RowData>
</List>
</ItemData>
<ItemList>"""
soup = BeautifulSoup(xml_doc, "xml")
all_data = []
for data in soup.select("RowData"):
d = {}
d["ObjectCadastreNr"] = data.find_previous("ObjectCadastreNr").text.strip()
d["ObjectType"] = data.find_previous("ObjectType").text.strip()
for t in data.find_all(text=True):
if t.strip() == "":
continue
d[t.parent.name] = t.strip()
all_data.append(d)
df = pd.DataFrame(all_data)
print(df)
Prints:
ObjectCadastreNr ObjectType KindId KindName Nr EstablishDate Area Measure
0 01000180062 PARCEL 7312050201 ekspluatācijas aizsargjoslas teritorija gar elektrisko tīklu kabeļu līniju 1 1997-02-24 0.0127 ha
1 01000180062 PARCEL 7312040200 ekspluatācijas aizsargjoslas teritorija gar elektronisko sakaru tīklu gaisvadu līniju 3 1996-01-13 NaN NaN
2 01000180062 PARCEL 7312060100 ekspluatācijas aizsargjoslas teritorija gar pazemes siltumvadu, siltumapgādes iekārtu un būvi 5 1997-01-13 NaN NaN