Home > OS >  Split a large xml file into multiple based on tag in Python
Split a large xml file into multiple based on tag in Python

Time:10-24

I have a very large xml file which I need to split into several based on a particular tag. The XML file is something like this:

<xml>
<file id="13">
  <head>
    <url>...</url>
    <pagesize>...</pagesize>
    <dtime>...</dtime>
    <encoding>UTF-8</encoding>
    <content-type>text/html; charset=utf-8</content-type>
    <keywords>...</keywords>
    <speaker>...</speaker>
    <talkid>2458</talkid>
    <videourl>...</videourl>
    <videopath>...</videopath>
    <date>...</date>
    <title>...</title>
    <description>...</description>
    <transcription>
      <seekvideo id="645">So in college,</seekvideo>
      ...
    </transcription>
    <wordnum>...</wordnum>
    <charnum>...</charnum>
  </head>
  <content> *** This is the content I am trying to save *** </content>
</file>
<file>
      ... 
</file>
</xml>

I want to extract the content of each file and save based on the talkid.

Here is the code I have tried with:

import xml.etree.ElementTree as ET

all_talks = 'path\\to\\big\\file'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        content = elem.find('content').text
        title = elem.find('talkid').text
        filename = format(title   ".txt")
        with open(filename, 'wb', encoding='utf-8') as f:
            f.write(ET.tostring(content), encoding='utf-8')

But I get the following error:

AttributeError: 'NoneType' object has no attribute 'text'

CodePudding user response:

If you're already using .iterparse() it's more generic to rely just on events:

import xml.etree.ElementTree as ET
from pathlib import Path

all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))

for event, element in context:
    if event == 'end':
        if element.tag == 'talkid':
            title = element.text
        elif element.tag == 'content':
            content = element.text
        elif element.tag == 'file' and title and content:
            with open(all_talks.with_name(title   '.txt'), 'w') as f:
                f.write(content)
    elif element.tag == 'file':
        content = title = None

You can help my country, check my profile info.

CodePudding user response:

Try doing it this way..

the issue is that the talkid is a child of the head tag and not the file tag.

import xml.etree.ElementTree as ET

all_talks = 'file.xml'

context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
    if elem.tag == 'file':
        head = elem.find('head')
        content = elem.find('content').text
        title = head.find('talkid').text
        filename = format(title   ".txt")
        with open(filename, 'wb') as f:  # 'wt' or just 'w' if you want to write text instead of bytes
            f.write(content.encode())    # in which case you would remove the .encode() 

CodePudding user response:

You can use Beautiful Soup to parse xml.

  • Related