I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Entries>
<EntrySynopsisDetail_1_0>
<EntryID>262148</EntryID>
<EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
<EntryTitle>Call for Mobility Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
<CategoryOfEntry>MAT</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
</Entries>
And my code is below:
from bs4 import BeautifulSoup
import pandas as pd
fd = open("file_120123.xml",'r')
data = fd.read()
Bs_data = BeautifulSoup(data,'xml')
ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try:
Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
Cat = ''
CatDict = {
"ENG":"English",
"MAT" :"Mathematics"
}
dataDf = []
for i in range(0,len(ID)):
if (Cat[i] == CatDict):
Cat[i] == CatDict.get(Cat[i])
rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
dataDf.append(rows)
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')
As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file. Now one of the elements is a key and I have created a dictionary listing all possible keys. Not all parents have that element. I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.
With this code, I get the error IndexError: list index out of range
on Cat[i]
on if (Cat[i] == CatDict):
line. Any insights on how to resolve this?
CodePudding user response:
If you just want to avoid raising the error, add a conditional break
for i in range(0,len(ID)):
if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded
if (Cat[i] == CatDict):
Cat[i] == CatDict.get(Cat[i])
rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
dataDf.append(rows)
CodePudding user response:
First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath). BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html.
Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner! Though you haven't shown your expected output, my guess is you expect the output below. Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0>
has <CategoryOfEntry>
twice, so I removed one:
entries = """<Entries>
<EntrySynopsisDetail_1_0>
<EntryID>262148</EntryID>
<EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
<EntryTitle>Call for Mobility Program</EntryTitle>
<CategoryOfEntry>MAT</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
</Entries>"""
pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")
Output:
EntryID EntryTitle CategoryOfEntry
0 262148 Establishment of the Graduate Internship Program ENG
1 2667654 Call for Mobility Program MAT