Is there a way to use pandas.read_xml() with out a URI/URL for namespaces?-CodePudding

In my XML file [studentinfo.xml] some tags have namespace prefixes, is there a way to loop through the xml file and parse tag content [all sibling and child tags] without defining the URI/URL for namespace?

If you have another way of parsing the xml file not using pandas I am open to any and all solutions.

<?xml version="1.0" encoding="UTF-8"?>
<stu:StudentBreakdown>
<stu:Studentdata>
    <stu:StudentScreening>
        <st:name>Sam Davies</st:name>
        <st:age>15</st:age>
        <st:hair>Black</st:hair>
        <st:eyes>Blue</st:eyes>
        <st:grade>10</st:grade>
        <st:teacher>Draco Malfoy</st:teacher>
        <st:dorm>Innovation Hall</st:dorm>
    </stu:StudentScreening>
    <stu:StudentScreening>
        <st:name>Cassie Stone</st:name>
        <st:age>14</st:age>
        <st:hair>Science</st:hair>
        <st:grade>9</st:grade>
        <st:teacher>Luna Lovegood</st:teacher>
    </stu:StudentScreening>
    <stu:StudentScreening>
        <st:name>Derek Brandon</st:name>
        <st:age>17</st:age>
        <st:eyes>green</st:eyes>
        <st:teacher>Ron Weasley</st:teacher>
        <st:dorm>Hogtie Manor</st:dorm>
    </stu:StudentScreening>
</stu:Studentdata>
</stu:StudentBreakdown>

below is my code:

import pandas as pd
from bs4 import BeautifulSoup
with open('studentinfo.xml', 'r') as f:
    file = f.read()  

def parse_xml(file):
    soup = BeautifulSoup(file, 'xml')
    df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
    all_items = soup.find_all('info')
    items_length = len(all_items)
    for index, info in enumerate(all_items):
        StudentName = info.find('<st:name>').text
        Age = info.find('<st:age>').text
        Hair = info.find('<st:hair>').text
        Eyes = info.find('<st:eyes>').text
        Grade = info.find('<st:grade>').text
        Teacher = info.find('<st:teacher>').text
        Dorm = info.find('<st:dorm>').text
      row = {
            'StudentName': StudentName,
            'Age': Age,
            'Hair': Hair,
            'Eyes': Eyes,
            'Grade': Grade,
            'Teacher': Teacher,
            'Dorm': Dorm
        }
        
        df1 = df1.append(row, ingore_index=True)
        print(f'Appending row %s of %s' %(index 1, items_length))
    
    return df1

Desired Output:

	Name	age	hair	eyes	grade	teacher	dorm
0	Sam Davies	15	Black	Blue	10	Draco Malfoy	Innovation Hall
1	Cassie Stone	14	Science	N/A	9	Luna Lovegood	N/A
2	Derek Brandon	17	N/A	green	N/A	Ron Weasley	Hogtie Manor

CodePudding user response：

You were about 90% there. I just fixed up a couple of things:

all_items : to find StudentScreening instead of info
info.find() statements : dealt with missing values
pd.concat() : instead of df1.append()
called the function parse_xml at the end

Here is the code:

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# Read in the XML file
with open('studentinfo.xml', 'r') as f:
    file = f.read()  

def parse_xml(file):
    soup = BeautifulSoup(file, 'xml')
    df1 = pd.DataFrame(columns=['StudentName', 'Age', 'Hair', 'Eyes', 'Grade', 'Teacher', 'Dorm'])
    all_items = soup.find_all('StudentScreening')
    for index, info in enumerate(all_items):

        row = {
            'StudentName': info.find('name').text if info.find('name') else np.nan,
            'Age': info.find('age').text if info.find('age') else np.nan,
            'Hair': info.find('hair').text if info.find('hair') else np.nan, 
            'Eyes': info.find('eyes').text if info.find('eyes') else np.nan,
            'Grade': info.find('grade').text if info.find('grade') else np.nan,
            'Teacher': info.find('teacher').text if info.find('teacher') else np.nan,
            'Dorm': info.find('dorm').text if info.find('dorm') else np.nan
        }
        
        df1 = pd.concat([df1, pd.Series(row).to_frame().T], ignore_index=True)
        
    
    return df1  


print(parse_xml(file))

OUTPUT:

     StudentName Age     Hair   Eyes Grade        Teacher             Dorm
0     Sam Davies  15    Black   Blue    10   Draco Malfoy  Innovation Hall
1   Cassie Stone  14  Science    NaN     9  Luna Lovegood              NaN
2  Derek Brandon  17      NaN  green   NaN    Ron Weasley     Hogtie Manor

CodePudding user response：

Here is another answer using an ElementTree:

Import the XML file
Remove the namespace prefixes using regex sub() method
Convert the XML document into an ElementTree
Iterate through the relevant nodes to extract the information needed
Convert to a dataframe with the expected output

Here is the code:

import xml.etree.ElementTree as ET
import re
import pandas as pd

# import the file
with open('studentinfo.xml', 'r') as f:
    file = f.read()  
    
# remove namespace prefixes    
file = re.sub(r'stu?:', '', file)

# Extract the XML into an ElementTree
root = ET.ElementTree(ET.fromstring(file)).getroot()

# translation between XML tags and column names
column_names = {'name': 'StudentName', 
                'age': 'Age', 
                'hair': 'Hair', 
                'eyes': 'Eyes', 
                'grade': 'Grade', 
                'teacher': 'Teacher', 
                'dorm': 'Dorm'}

# Extract the relevant information from the ElementTree
results = []
temp_dict = {}
for node in root.iter():
    if node.tag in column_names.keys():
        temp_dict[column_names[node.tag]] = node.text 
        
    elif len(temp_dict) > 0:
        results.append(temp_dict.copy())
        temp_dict.clear()

results.append(temp_dict.copy())    

# Create a dataframe from the extracted information
df = pd.DataFrame(results)
print(df)

OUTPUT:

     StudentName Age     Hair   Eyes Grade        Teacher             Dorm
0     Sam Davies  15    Black   Blue    10   Draco Malfoy  Innovation Hall
1   Cassie Stone  14  Science    NaN     9  Luna Lovegood              NaN
2  Derek Brandon  17      NaN  green   NaN    Ron Weasley     Hogtie Manor