Parse a large XML by certain count of records by records using the SAX parser in Python-CodePudding

I am able to parse the large xml , since i am having memory issue am using SAX parser. am using XMLGenerator to split the xml and again parse the same. My question is, is there a way to parse a large xml part by part, for example once I parse the first 10000 records load into csv or dataframes in this case I will avoid redoing the same parse on the chunks..

import xml.sax
from collections import defaultdict
import pandas as pd
from sqlalchemy import create_engine

class EmplyeeData(xml.sax.ContentHandler):

    def __init__(self):
        self.employee_dict = defaultdict(list)
        
    def startElement(self, tag, attr):
        self.tag = tag
        if tag == 'Emp1oyee':
            self.employee_dict['Emp1oyee_ID'].append(attr['id'])

                
    def characters(self, content):
        if content.strip():
            if self.tag == 'FName': self.FName = content
            elif self.tag == 'LName': self.LName = content
            elif self.tag == 'City': self.City = content

    
    def endElement(self, tag):
        if tag == 'FName': self.employee_dict['FName'].append(self.FName)
        elif tag == 'LName': self.employee_dict['LName'].append(self.LName)
        elif tag == 'City': self.employee_dict['City'].append(self.City)




handler = EmployeeData()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse('employee_xml.xml')
EmployeeDetails = parser.getContentHandler()

EmployeeData_out = EmployeeDetails.employee_dict


df = pd.DataFrame(EmployeeData_out, columns=EmployeeData_out.keys()).set_index('Emp1oyee_ID')
# for example I am writing in the csv file, actually I will be loading the data into database table.
#I want to load the data incrementaly by parsing certain count of records at a time for example 10000 records at a time.
##con_eng = create_engine('oracle://[user]:[pass]@[host]:[port]/[schema]', echo=False)
##df.to_sql(name='target_table',con=con_eng ,if_exists = 'append', index=False)
df.to_csv('employee_details.csv', sep=',', encoding='utf-8')

Sample XML

<?xml version="1.0" ?>
<Emp1oyees>
    <Emp1oyee id=1>
        <FName>SAM</FName>
        <LName>MARK</LName>
        <City>NewJersy</City>
    </Emp1oyee>
    <Emp1oyee id=2>
        <FName>RAJ</FName>
        <LName>KAMAL</LName>
        <City>NewYork</City>
    </Emp1oyee>
    <Emp1oyee id=3>
        <FName>Brain</FName>
        <LName>wood</LName>
        <City>Buffalo</City>
    </Emp1oyee>
    :
    :
    :
    :
    :
    <Emp1oyee id=1000000>
        <FName>Mark</FName>
        <LName>wood</LName>
        <City>NewJersy</City>
    </Emp1oyee> 
</Emp1oyees>

CodePudding user response：

Given your XML is pretty shallow, consider the new v1.5 iterparse support for large XML using pandas.read_xml, available for either lxml (default) or etree parsers.

employees_df = (
    pd.read_xml(
        "Input.xml", 
        iterparse = {"Employee": ["id", "FName", "LName", "City"]},
        names = ["Employee_ID", "FName", "LName", "City"],
        parser = "etree"
    ).set_index('Emp1oyee_ID')
)

(Testing parsed Wikipedia's 12 GB article daily dump XML into a 3.5mil-row data frame in 5 minutes on a laptop of 8GB RAM. See output in docs.)

CodePudding user response：

Try following powershell :

using assembly System.Xml
using assembly System.Xml.Linq

$FILENAME = "c:\temp\test.xml"

$reader = [System.Xml.XmlReader]::Create($FILENAME)
while($reader.EOF -eq $False)
{
   if ($reader.Name -ne "Employee")
   {
      $reader.ReadToFollowing("Employee")
   }
   if ($reader.EOF -eq $False)
   {
       $employee = [System.Xml.Linq.XElement]::ReadFrom($reader)
       $id = $employee.Attribute("id").Value 
       $fname = $employee.Element("FName").Value
       $lname = $employee.Element("LName").Value 
       $city = $employee.Element("City").Value 
       Write-Output "id = $id first = $fname last = $lname city = $city"
   }
}