Problem is parsing through an xml that starts right after <Envelope>
from bs4 import BeautifulSoup
Filename =input("Enter File name to be imported :" )
imp_ext = ".xml"
imp_file = ("".join([Filename,imp_ext]))
#it is in UTF-16BE format
with open(imp_file, encoding= 'UTF-16') as fp:
soup = BeautifulSoup(fp, 'xml')
Soup has this data:
<?xml version="1.0" encoding="UTF-8"?>
<ENVELOPE>
<DSPACCNAME>
<DSPDISPNAME>206375</DSPDISPNAME>
</DSPACCNAME>
<DSPSTKINFO>
<DSPSTKOUT>
<DSPOUTQTY>1 EA</DSPOUTQTY>
<DSPOUTRATE>715.00</DSPOUTRATE>
<DSPNETTCRAMTA>715.00</DSPNETTCRAMTA>
<DSPCRAMTA>715.00</DSPCRAMTA>
<DSPCONSAMT>-358.62</DSPCONSAMT>
<DSPGPAMT>356.38</DSPGPAMT>
<DSPGPPERC>49.84 %</DSPGPPERC>
</DSPSTKOUT>
<DSPSTKCL>
<DSPCLQTY>3 EA</DSPCLQTY>
<DSPCLRATE>358.62</DSPCLRATE>
<DSPCLAMTA>-1075.87</DSPCLAMTA>
</DSPSTKCL>
</DSPSTKINFO>
<SSBATCHNAME>
<SSBATCH />
<SSGODOWN>Ware -House (Mankoli-Bhiwandi)</SSGODOWN>
</SSBATCHNAME>
<DSPSTKINFO>
<DSPSTKOUT>
<DSPOUTQTY>1 EA</DSPOUTQTY>
<DSPOUTRATE>715.00</DSPOUTRATE>
<DSPNETTCRAMTA>715.00</DSPNETTCRAMTA>
<DSPCRAMTA>715.00</DSPCRAMTA>
<DSPCONSAMT>-358.62</DSPCONSAMT>
<DSPGPAMT>356.38</DSPGPAMT>
<DSPGPPERC>49.84 %</DSPGPPERC>
</DSPSTKOUT>
<DSPSTKCL>
<DSPCLQTY>3 EA</DSPCLQTY>
<DSPCLRATE>358.62</DSPCLRATE>
<DSPCLAMTA>-1075.87</DSPCLAMTA>
</DSPSTKCL>
</DSPSTKINFO>
</ENVELOPE>
then I am trying to extract data from xml file I also tried:
for a in soup.findAll('DSPACCNAME'):
for b in soup.findAll('DSPSTKINFO'):
print(a.DSPDISPNAME)
print(b.DSPCLQTY)
print(b.DSPCLRATE)
print(b.DSPCLAMTA)
I am getting output that is something like this:
206375
1 EA
715.00
715.00
715.00
-358.62
356.38
49.84 %
Issue is I don't have a parent class that creates a boundary. I am trying to extract the data in CSV format. Data comes from tally. It is called stock summary to be exact. I have no idea how to proceed. Data comes with empty values as well. And that needs to be captured as it is.
CodePudding user response:
I highly recommend not using BeautifulSoup, and in general do not try to parse the XML yourself.
Instead, use a tool that's built to handle XML and relationships that are not strictly parent-child. Specifically, it looks like you need to handle the sibling relationship:
- DSPACCNAME
- display name
- DSPSTKINFO
- ... rest of the data you care about
XSLT and XPATH can handle this easily, especially finding the previous DSPACCNAME element from any DSPSTKINFO element.
This is not a complete example, I don't even know what you expect the final output to be, but I think it will show you the potential for solving your problem:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<!-- For CSV, choose flat text -->
<xsl:output method="text"/>
<xsl:template match="/">
<!-- Print CSV header -->
<xsl:text>DispName,StkOut_OutQty,StkOut_OutRate,StkOut_NetTCramta
</xsl:text>
<!-- Start processing rows -->
<xsl:apply-templates select="ENVELOPE/DSPSTKINFO"/>
</xsl:template>
<!-- It looks like DSPSTKINFO will be your "row" -->
<xsl:template match="ENVELOPE/DSPSTKINFO">
<!-- Get the previous DSPDISPNAME -->
<xsl:value-of select="preceding-sibling::DSPACCNAME/DSPDISPNAME"/>
<xsl:text>,</xsl:text>
<!-- Get the "row data" -->
<xsl:value-of select="DSPSTKOUT/DSPOUTQTY"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="DSPSTKOUT/DSPOUTRATE"/>
<xsl:text>,</xsl:text>
<xsl:value-of select="DSPSTKOUT/DSPNETTCRAMTA"/>
<!-- Print a newline to "finish" this row -->
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
When I run that against your sample XML, using the open-source XSLT processor xsltproc:
xsltproc main.xsl input.xml
I get:
DispName,StkOut_OutQty,StkOut_OutRate,StkOut_NetTCramta
206375,1 EA,715.00,715.00
206375,1 EA,715.00,715.00
If you need to run this from Python, the 3rd-party lxml module has an XSLT class so you can run the transform in code.
Otherwise, if you must process the XML by hand in Python, look at the XML Parser example from the docs. You'll see that you need to do things for yourself, like:
- set up event handlers for when a start tag is processed
- recognize its name
- (not shown, but...) use some state to know where in the tree/structure you are when an event is called
CodePudding user response:
I summarize my proposal for a solution with the SAX-parser, which is very performant, but its structure takes a bit of effort for the tuning of the needed results.
#https://stackoverflow.com/questions/70437073/xml-with-multiple-tags
from collections import OrderedDict
from xml.sax.handler import ContentHandler
import xml.sax
import sys
class CustomHandler(ContentHandler):
def __init__( self ):
tmp = ["DSPACCNAME", "DSPSTKOUT", "DSPSTKCL"] # extensible with further tags
self.tags = OrderedDict()
for t in tmp:
self.tags.setdefault(t, False)
def startElement(self, name, attrs):
if name in self.tags.keys():
self.tags[name] = True
sys.stdout.write("\n%s\ns\n" % (name.strip(), "="*15))
def characters(self, content):
for v in self.tags.values():
if v:
sys.stdout.write("%3s" % content.strip())
def endElement(self, name):
if name in self.tags.keys():
self.tags[name] = False
parser = xml.sax.make_parser()
handler = CustomHandler()
parser.setContentHandler(handler)
parser.parse("test.xml")
RESULTS:
DSPACCNAME
===============
206375
DSPSTKOUT
===============
1 EA 715.00 715.00 715.00 -358.62 356.38 49.84 %
DSPSTKCL
===============
3 EA 358.62 -1075.87
DSPSTKOUT
===============
1 EA 715.00 715.00 715.00 -358.62 356.38 49.84 %
DSPSTKCL
===============
3 EA 358.62 -1075.87