How do I convert a large XML file to a CSV file?-CodePudding

Because of the need to build the knowledge graph, I wanted to convert a 3.39GB XML data file to CSV format, but when I tried Python, my computer ran out of memory.

The following code：

import glob
import pandas as pd
import xml.etree.ElementTree as ET
import os

def xml_to_csv(path):
    xml_list = []
    for xml_file in glob.glob(path   '/Badges.xml'): 
        print('**********************************')
        print(xml_file)
        tree = ET.parse(xml_file)
        root = tree.getroot()
        # for member in root.findall('row'):
        for member in root.findall('row'):
            value = (member.attrib.get('Id', ''),
                     member.attrib.get('UserId', ''),
                     member.attrib.get('Name', ''),
                     member.attrib.get('Date', ''),
                     member.attrib.get('Class', ''),
                     member.attrib.get('TagBased', ''),
                     )
            xml_list.append(value)



    column_name = ['Id', 'UserId', 'Name', 'Date', 'Class','TagBased']
    xml_df = pd.DataFrame(xml_list, columns = column_name)
    # print('----------------')
    # print(xml_df)
    return xml_df


if __name__ == "__main__":

    xml_path = 'D:/【论文】/【数据集】/1-StackOverflow数据集-2008-2021/stackoverflow.com-Badges'
    print(os.path.exists(xml_path))
    xml_df = xml_to_csv(xml_path)
    print('**********************************')
    print(xml_df)
  
    xml_df.to_csv('D:/【论文】/【数据集】/1-StackOverflow数据集-2008-2021/stackoverflow.com-Badges/Badges.csv', index = None)
    print('Successfully converted xml to csv.')

CodePudding user response：

There are various technologies for streamed processing of XML. One of them is XSLT 3.0, where you would write

<xsl:mode streamable="yes"/>
<xsl:output method="text"/>
<xsl:template match="row">
  <xsl:value-of select="@Id, @UserId, @Name, @Class, @TagBased"
     separator=","/>
  <xsl:text>&#xa;</xsl:text>
</xsl:template>

CodePudding user response：

You can use ET.iterparse() to parse a large XML file in "streaming" mode, without loading all of it into memory first.

Iterparse lets you react to events as they occur while reading the file: start happens when a start tag (e.g <row Id="..." UserId="...">) was read, end happens when an end tag (e.g. </row>) was read. There are more events for more specialized tasks.

For reading out attribute values, listening to start events is good enough. Simple example:

import xml.etree.ElementTree as ET

def parse_rows(xml_path):
    for event, elem in ET.iterparse(xml_path, ['start']):
        if elem.tag == 'row':
            yield (
                elem.attrib.get('Id', ''),
                elem.attrib.get('UserId', ''),
                elem.attrib.get('Name', ''),
                elem.attrib.get('Date', ''),
                elem.attrib.get('Class', ''),
                elem.attrib.get('TagBased', ''),
            )
        elem.clear()

for row in parse_rows(path   '/Badges.xml')
    print(row)

parse_rows() is a generator. You should be able to fill a dataframe directly from it, without building a temporary xml_list first.

df = pd.DataFrame(parse_rows(path   '/Badges.xml'))