I have a very large xml file which I need to split into several based on a particular tag. The XML file is something like this:
<xml>
<file id="13">
<head>
<url>...</url>
<pagesize>...</pagesize>
<dtime>...</dtime>
<encoding>UTF-8</encoding>
<content-type>text/html; charset=utf-8</content-type>
<keywords>...</keywords>
<speaker>...</speaker>
<talkid>2458</talkid>
<videourl>...</videourl>
<videopath>...</videopath>
<date>...</date>
<title>...</title>
<description>...</description>
<transcription>
<seekvideo id="645">So in college,</seekvideo>
...
</transcription>
<wordnum>...</wordnum>
<charnum>...</charnum>
</head>
<content> *** This is the content I am trying to save *** </content>
</file>
<file>
...
</file>
</xml>
I want to extract the content of each file and save based on the talkid.
Here is the code I have tried with:
import xml.etree.ElementTree as ET
all_talks = 'path\\to\\big\\file'
context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
if elem.tag == 'file':
content = elem.find('content').text
title = elem.find('talkid').text
filename = format(title ".txt")
with open(filename, 'wb', encoding='utf-8') as f:
f.write(ET.tostring(content), encoding='utf-8')
But I get the following error:
AttributeError: 'NoneType' object has no attribute 'text'
CodePudding user response:
If you're already using .iterparse()
it's more generic to rely just on events:
import xml.etree.ElementTree as ET
from pathlib import Path
all_talks = Path(r'file.xml')
context = ET.iterparse(all_talks, events=('start', 'end'))
for event, element in context:
if event == 'end':
if element.tag == 'talkid':
title = element.text
elif element.tag == 'content':
content = element.text
elif element.tag == 'file' and title and content:
with open(all_talks.with_name(title '.txt'), 'w') as f:
f.write(content)
elif element.tag == 'file':
content = title = None
You can help my country, check my profile info.
CodePudding user response:
Try doing it this way..
the issue is that the talkid is a child of the head tag and not the file tag.
import xml.etree.ElementTree as ET
all_talks = 'file.xml'
context = ET.iterparse(all_talks, events=('end', ))
for event, elem in context:
if elem.tag == 'file':
head = elem.find('head')
content = elem.find('content').text
title = head.find('talkid').text
filename = format(title ".txt")
with open(filename, 'wb') as f: # 'wt' or just 'w' if you want to write text instead of bytes
f.write(content.encode()) # in which case you would remove the .encode()
CodePudding user response:
You can use Beautiful Soup to parse xml.