I'm processing XML documents like the following.
<tok lemma="i" xpos="CC">e</tok>
<tok lemma="que" xpos="CS">que</tok>
<tok lemma="aquey" xpos="PD0MP0">aqueys</tok>
<tok lemma="marit" xpos="NCMP000">marits</tok>
<tok lemma="estar" xpos="VMIP3P0">stiguen</tok>
[...]
<tok lemma="habitar" xpos="VMIP3P0">habiten</tok>
<tok lemma="en" xpos="SPS00">en</tok>
<tok lemma="aquex" xpos="PD0FS0">aqueix</tok>
<tok lemma="terra" xpos="NCMS000">món</tok>
[...]
<tok lemma="viure" xpos="VMIP3P0">viuen</tok>
<tok lemma="en" xpos="SPS00">en</tok>
<tok lemma="aquex" xpos="PD0FP0">aqueixes</tok>
<tok lemma="casa" xpos="NCFP000">cases</tok>
I'm using the following code to change the attributes of certain elements whenever certain conditions are met. The code works as expected and I'm getting the output I want. However the time it takes to process all the files seems way too much. If you notice, I have some print statements that allow me to monitor the whole process. Sometimes it takes 5 minutes or more between two prints.
In fact, I've had to kill the process because it was taking too long. I know it is working fine because the output files that I get are correctly modified and when I did a test with a much smaller number of files the whole process managed to run to its end without a glitch (although taking forever to finish).
I have one of the new macs with the M1 max silicon chip so I thought this would go a lot faster. It is just as slow with the Intel chip. Is this normal when using LXML or being a novice I'm just producing very inefficient code? Is there any way to make this kind of thing faster?
Thanks in advance for any help you can provide.
#!/usr/bin/env python
# coding: utf-8
import os
import lxml.etree as et
#ROOT = '/Users/josepm.fontana/Downloads/_POTI'
ROOT = '/Users/josepm.fontana/Downloads/CICA_TESTIN'
ext = ('.xml')
def xml_change(root_element):
for el in root.xpath('//tok[following-sibling::tok[1][re:match(@xpos, "^N")]]',
namespaces={"re": "http://exslt.org/regular-expressions"}):
if el.text == 'aquest' or el.text == 'Aquest' or el.text == 'AQUEST' or el.text == 'aqueix' or el.text == 'Aqueix' or el.text == 'AQUEIX':
print(el.text)
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0MS0')
el.set('lemma', 'aquest')
if el.text == 'aquel' or el.text == 'Aquel' or el.text == 'AQUEL' or el.text == 'aquell' or el.text == 'Aquell' or el.text == 'AQUELL':
print(el.text)
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0MS0')
el.set('lemma', 'aquell')
if el.text == 'aquests' or el.text == 'Aquests' or el.text == 'AQUESTS':
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0MP0')
el.set('lemma', 'aquest')
if el.text == 'aquells' or el.text == 'Aquells' or el.text == 'AQUELLS' or el.text == 'aqueys' or el.text == 'Aqueys' or el.text == 'AQUEYS' or el.text == 'aqueyls' or el.text == 'Aqueyls' or el.text == 'AQUEYLS' or el.text == 'aqueys' or el.text == 'Aqueys' or el.text == 'AQUEYS':
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0MP0')
el.set('lemma', 'aquell')
if el.text == 'aquestas' or el.text == 'Aquestas' or el.text == 'AQUESTAS' or el.text == 'aqueixes' or el.text == 'Aqueixes' or el.text == 'AQUEIXES':
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0FP0')
el.set('lemma', 'aquest')
if el.text == 'aqualas' or el.text == 'Aqualas' or el.text == 'AQUALAS' or el.text == 'aquelas' or el.text == 'Aquelas' or el.text == 'AQUELAS' or el.text == 'aqueles' or el.text == 'Aqueles' or el.text == 'AQUELES' or el.text == 'aquellas' or el.text == 'Aquellas' or el.text == 'AQUELLAS' or el.text == 'aquelles' or el.text == 'Aquelles' or el.text == 'AQUELLES':
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0FP0')
el.set('lemma', 'aquell')
# iterate all dirs
for root, dirs, files in os.walk(ROOT):
# iterate all files
for file in files:
if file.endswith(ext):
# join root dir and file name
file_path = os.path.join(ROOT, file)
# load root element from file
root = et.parse(file_path).getroot()
# recursively change elements from xml
xml_change(root)
# init tree object from root
tree = et.ElementTree(root)
# save cleaned xml tree object to file. Important to specify encoding
tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)
CodePudding user response:
Code is evaluating all conditions while only one would be met each time. One possible optimization is to make them if-elseif
if el.text == 'aquest' or el.text == 'Aquest' or el.text == 'AQUEST' or el.text == 'aqueix' or el.text == 'Aqueix' or el.text == 'AQUEIX':
print(el.text)
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0MS0')
el.set('lemma', 'aquest')
else if el.text == 'aquel' or el.text == 'Aquel' or el.text == 'AQUEL' or el.text == 'aquell' or el.text == 'Aquell' or el.text == 'AQUELL':
print(el.text)
print('Current value is:', el.get('lemma'), el.get('xpos'))
el.set('xpos', 'DD0MS0')
el.set('lemma', 'aquell')
# other else if here
Also, plain XPath could be used to avoid regular expressions
//tok[following-sibling::tok[1][starts-with(@xpos, "N")]]