The logic here is: If the page-element does not contain "| kasteeltype" then remove page-element, otherwise keep the page-element.
#Import ElementTree
import defusedxml.ElementTree as ET
#Set Tree & Root
tree = ET.parse("nlwiki-20221020-pages-meta-current1.xml-p1p134538")
root = tree.getroot()
#Namespaces
NSPage = "{http://www.mediawiki.org/xml/export-0.10/}page"
NSRevision = "{http://www.mediawiki.org/xml/export-0.10/}revision"
NSText = "{http://www.mediawiki.org/xml/export-0.10/}text"
#Modify XML
for page in root.findall(NSPage):
for revision in page.findall(NSRevision):
text = revision.find(NSText)
kasteeltype = "| kasteeltype"
if kasteeltype not in text.text:
root.remove(page)
#Output
tree.write("output.xml")
This code results in the following error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [8], line 18
16 text = revision.find(NSText)
17 kasteeltype = "| kasteeltype"
---> 18 if kasteeltype not in text.text:
19 root.remove(page)
20 else:
TypeError: argument of type 'NoneType' is not iterable
I'm a bit clueless now about how to proceed.
The XML-file can be found here: https://dumps.wikimedia.org/nlwiki/20221020/nlwiki-20221020-pages-meta-current1.xml-p1p134538.bz2
It is quite a large file since it is a wikipedia dump.
The expected result should be that all page elements that do not contain the string "| kasteeltype" in the text element under the parent revision should be removed.
CodePudding user response:
Verify that the element was found
pages = root.findall(NSPage)
if pages is not None:
for page in pages:
revisions = page.findall(NSRevision)
if revisions is not None:
for revision in revisions:
text = revision.find(NSText)
if text is not None and text.text is not None:
kasteeltype = "| kasteeltype"
if kasteeltype not in text.text:
root.remove(page)