I am able to parse the large xml , since i am having memory issue am using SAX parser. am using XMLGenerator to split the xml and again parse the same. My question is, is there a way to parse a large xml part by part, for example once I parse the first 10000 records load into csv or dataframes in this case I will avoid redoing the same parse on the chunks..
import xml.sax
from collections import defaultdict
import pandas as pd
from sqlalchemy import create_engine
class EmplyeeData(xml.sax.ContentHandler):
def __init__(self):
self.employee_dict = defaultdict(list)
def startElement(self, tag, attr):
self.tag = tag
if tag == 'Emp1oyee':
self.employee_dict['Emp1oyee_ID'].append(attr['id'])
def characters(self, content):
if content.strip():
if self.tag == 'FName': self.FName = content
elif self.tag == 'LName': self.LName = content
elif self.tag == 'City': self.City = content
def endElement(self, tag):
if tag == 'FName': self.employee_dict['FName'].append(self.FName)
elif tag == 'LName': self.employee_dict['LName'].append(self.LName)
elif tag == 'City': self.employee_dict['City'].append(self.City)
handler = EmployeeData()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse('employee_xml.xml')
EmployeeDetails = parser.getContentHandler()
EmployeeData_out = EmployeeDetails.employee_dict
df = pd.DataFrame(EmployeeData_out, columns=EmployeeData_out.keys()).set_index('Emp1oyee_ID')
# for example I am writing in the csv file, actually I will be loading the data into database table.
#I want to load the data incrementaly by parsing certain count of records at a time for example 10000 records at a time.
##con_eng = create_engine('oracle://[user]:[pass]@[host]:[port]/[schema]', echo=False)
##df.to_sql(name='target_table',con=con_eng ,if_exists = 'append', index=False)
df.to_csv('employee_details.csv', sep=',', encoding='utf-8')
Sample XML
<?xml version="1.0" ?>
<Emp1oyees>
<Emp1oyee id=1>
<FName>SAM</FName>
<LName>MARK</LName>
<City>NewJersy</City>
</Emp1oyee>
<Emp1oyee id=2>
<FName>RAJ</FName>
<LName>KAMAL</LName>
<City>NewYork</City>
</Emp1oyee>
<Emp1oyee id=3>
<FName>Brain</FName>
<LName>wood</LName>
<City>Buffalo</City>
</Emp1oyee>
:
:
:
:
:
<Emp1oyee id=1000000>
<FName>Mark</FName>
<LName>wood</LName>
<City>NewJersy</City>
</Emp1oyee>
</Emp1oyees>
CodePudding user response:
Given your XML is pretty shallow, consider the new v1.5 iterparse support for large XML using pandas.read_xml
, available for either lxml
(default) or etree
parsers.
employees_df = (
pd.read_xml(
"Input.xml",
iterparse = {"Employee": ["id", "FName", "LName", "City"]},
names = ["Employee_ID", "FName", "LName", "City"],
parser = "etree"
).set_index('Emp1oyee_ID')
)
(Testing parsed Wikipedia's 12 GB article daily dump XML into a 3.5mil-row data frame in 5 minutes on a laptop of 8GB RAM. See output in docs.)
CodePudding user response:
Try following powershell :
using assembly System.Xml
using assembly System.Xml.Linq
$FILENAME = "c:\temp\test.xml"
$reader = [System.Xml.XmlReader]::Create($FILENAME)
while($reader.EOF -eq $False)
{
if ($reader.Name -ne "Employee")
{
$reader.ReadToFollowing("Employee")
}
if ($reader.EOF -eq $False)
{
$employee = [System.Xml.Linq.XElement]::ReadFrom($reader)
$id = $employee.Attribute("id").Value
$fname = $employee.Element("FName").Value
$lname = $employee.Element("LName").Value
$city = $employee.Element("City").Value
Write-Output "id = $id first = $fname last = $lname city = $city"
}
}