so I basically try to get all elements of one type of Tags in a XML-File into a Pandas dataframe, to rearrange a little bit and export into an Excel-file.
Everything works fine, but there are some Tags that have missing attributes and whenever I try to get them into a list, it states a KeyError, that I can't attack with a defaultdict.
So I put all the tags into a list that looks like this:
import pandas as pd
import re
import parse
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
import glob, os
import collections
list = [<TAG attr_1="A" attr_2="1" attr_3="01"/>,<TAG attr_1="B" attr_3="02"/>,<TAG attr_1="C" attr_2="3" attr_3="03"/>]
data = []
for result in list:
try:
data.append({
'column1': result['attr_1'],
'column2': result['attr_2'],
'column3': result['attr_3'],
})
except KeyError as message:
print(message)
So since the 2nd tag is missing the attr_2, I get a key error and the script prints the message as:
'attr_2'
Does anybody have an idea how to attack this?
Since list is not a dictionary, but it gets one during the for-loop, I cannot use defaultdict it seems.
CodePudding user response:
I believe that the below is what you are looking for. The idea is to look at each entry in xml_list
as xml document, parse it and get the attrributes.
import xml.etree.ElementTree as ET
xml_list = ['<TAG attr_1="A" attr_2="1" attr_3="01"/>',
'<TAG attr_1="B" attr_3="02"/>',
'<TAG attr_1="C" attr_2="3" attr_3="03"/>']
result = []
for xml in xml_list:
root = ET.fromstring(xml)
entry = {}
for i in range(1,4):
entry[f'attr_{i}'] = root.attrib.get(f'attr_{i}',None)
result.append(entry)
print(result)
output
[{'attr_1': 'A', 'attr_2': '1', 'attr_3': '01'}, {'attr_1': 'B', 'attr_2': None, 'attr_3': '02'}, {'attr_1': 'C', 'attr_2': '3', 'attr_3': '03'}]
CodePudding user response:
Thanks, I managed to solve it by using the much better xml-function from pandas, that is available since 1.3.0.
It is called pd.read_xml! Pretty cool thing, that makes it much easier.