Home > OS >  Finding Exception for KeyErrors for a bs4.element.ResultSet out of an XML-file that I put into a Pan
Finding Exception for KeyErrors for a bs4.element.ResultSet out of an XML-file that I put into a Pan

Time:03-01

so I basically try to get all elements of one type of Tags in a XML-File into a Pandas dataframe, to rearrange a little bit and export into an Excel-file.

Everything works fine, but there are some Tags that have missing attributes and whenever I try to get them into a list, it states a KeyError, that I can't attack with a defaultdict.

So I put all the tags into a list that looks like this:

import pandas as pd
import re
import parse
import requests
from bs4 import BeautifulSoup
from collections import defaultdict
import glob, os
import collections

list = [<TAG attr_1="A" attr_2="1" attr_3="01"/>,<TAG attr_1="B" attr_3="02"/>,<TAG attr_1="C" attr_2="3" attr_3="03"/>]
data = []

for result in list:
    try:
        data.append({
            'column1': result['attr_1'],
            'column2': result['attr_2'],
            'column3': result['attr_3'],
        })
    except KeyError as message:
        print(message)

So since the 2nd tag is missing the attr_2, I get a key error and the script prints the message as:

'attr_2'

Does anybody have an idea how to attack this?

Since list is not a dictionary, but it gets one during the for-loop, I cannot use defaultdict it seems.

CodePudding user response:

I believe that the below is what you are looking for. The idea is to look at each entry in xml_list as xml document, parse it and get the attrributes.

import xml.etree.ElementTree as ET

xml_list = ['<TAG attr_1="A" attr_2="1" attr_3="01"/>',
              '<TAG attr_1="B" attr_3="02"/>',
              '<TAG attr_1="C" attr_2="3" attr_3="03"/>']

result = []
for xml in xml_list:
  root = ET.fromstring(xml)
  entry = {}
  for i in range(1,4):
    entry[f'attr_{i}'] = root.attrib.get(f'attr_{i}',None)
  result.append(entry)
    
print(result)

output

[{'attr_1': 'A', 'attr_2': '1', 'attr_3': '01'}, {'attr_1': 'B', 'attr_2': None, 'attr_3': '02'}, {'attr_1': 'C', 'attr_2': '3', 'attr_3': '03'}]

CodePudding user response:

Thanks, I managed to solve it by using the much better xml-function from pandas, that is available since 1.3.0.

It is called pd.read_xml! Pretty cool thing, that makes it much easier.

  • Related