How to avoid running into IndexError: list index out of range error if an element is nonexistent whi-CodePudding

I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>

And my code is below:

from bs4 import BeautifulSoup
import pandas as pd 

fd = open("file_120123.xml",'r')
data = fd.read()

Bs_data = BeautifulSoup(data,'xml')

ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try: 
   Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
   Cat = ''

CatDict = {
    "ENG":"English",
    "MAT" :"Mathematics"
}

dataDf = []
for i in range(0,len(ID)):
      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)
    
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')

As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file. Now one of the elements is a key and I have created a dictionary listing all possible keys. Not all parents have that element. I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.

With this code, I get the error IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict): line. Any insights on how to resolve this?

CodePudding user response：

If you just want to avoid raising the error, add a conditional break

for i in range(0,len(ID)):
      if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded

      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)

CodePudding user response：

First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath). BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html.

Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner! Though you haven't shown your expected output, my guess is you expect the output below. Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0> has <CategoryOfEntry> twice, so I removed one:

entries = """<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
  
<EntrySynopsisDetail_1_0>
        <EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>"""

pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")

Output:

EntryID        EntryTitle                          CategoryOfEntry
0   262148      Establishment of the Graduate Internship Program    ENG
1   2667654     Call for Mobility Program                         MAT