Python and LXML: extremely slow, more efficient code?-CodePudding

I'm processing XML documents like the following.

<tok lemma="i" xpos="CC">e</tok> 
<tok lemma="que" xpos="CS">que</tok> 
<tok lemma="aquey" xpos="PD0MP0">aqueys</tok> 
<tok lemma="marit" xpos="NCMP000">marits</tok> 
<tok lemma="estar" xpos="VMIP3P0">stiguen</tok>  
[...]
<tok lemma="habitar" xpos="VMIP3P0">habiten</tok> 
<tok lemma="en" xpos="SPS00">en</tok>
<tok lemma="aquex" xpos="PD0FS0">aqueix</tok> 
<tok lemma="terra" xpos="NCMS000">món</tok>
[...]
<tok lemma="viure" xpos="VMIP3P0">viuen</tok> 
<tok lemma="en" xpos="SPS00">en</tok>
<tok lemma="aquex" xpos="PD0FP0">aqueixes</tok> 
<tok lemma="casa" xpos="NCFP000">cases</tok>

I'm using the following code to change the attributes of certain elements whenever certain conditions are met. The code works as expected and I'm getting the output I want. However the time it takes to process all the files seems way too much. If you notice, I have some print statements that allow me to monitor the whole process. Sometimes it takes 5 minutes or more between two prints.

In fact, I've had to kill the process because it was taking too long. I know it is working fine because the output files that I get are correctly modified and when I did a test with a much smaller number of files the whole process managed to run to its end without a glitch (although taking forever to finish).

I have one of the new macs with the M1 max silicon chip so I thought this would go a lot faster. It is just as slow with the Intel chip. Is this normal when using LXML or being a novice I'm just producing very inefficient code? Is there any way to make this kind of thing faster?

Thanks in advance for any help you can provide.

 #!/usr/bin/env python
# coding: utf-8
import os
import lxml.etree as et

#ROOT = '/Users/josepm.fontana/Downloads/_POTI'
ROOT = '/Users/josepm.fontana/Downloads/CICA_TESTIN'
ext = ('.xml')


def xml_change(root_element):

    for el in root.xpath('//tok[following-sibling::tok[1][re:match(@xpos, "^N")]]', 
        namespaces={"re": "http://exslt.org/regular-expressions"}):
        if el.text == 'aquest' or el.text == 'Aquest' or el.text == 'AQUEST' or el.text == 'aqueix' or el.text == 'Aqueix' or el.text == 'AQUEIX':

            print(el.text)
            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MS0')
            el.set('lemma', 'aquest')


        if el.text == 'aquel' or el.text == 'Aquel' or el.text == 'AQUEL' or el.text == 'aquell' or el.text == 'Aquell' or el.text == 'AQUELL':

            print(el.text)
            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MS0')
            el.set('lemma', 'aquell')

                     

        if el.text == 'aquests' or el.text == 'Aquests' or el.text == 'AQUESTS':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MP0')
            el.set('lemma', 'aquest')

        if el.text == 'aquells' or el.text == 'Aquells' or el.text == 'AQUELLS' or el.text == 'aqueys' or el.text == 'Aqueys'  or el.text == 'AQUEYS' or el.text == 'aqueyls'  or el.text == 'Aqueyls'  or el.text == 'AQUEYLS' or el.text == 'aqueys'  or el.text == 'Aqueys'  or el.text == 'AQUEYS':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MP0')
            el.set('lemma', 'aquell')

        if el.text == 'aquestas' or el.text == 'Aquestas' or el.text == 'AQUESTAS' or el.text == 'aqueixes' or el.text == 'Aqueixes' or el.text == 'AQUEIXES':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0FP0')
            el.set('lemma', 'aquest')
        
        if el.text == 'aqualas' or el.text == 'Aqualas' or el.text == 'AQUALAS' or el.text == 'aquelas' or el.text == 'Aquelas' or el.text == 'AQUELAS' or el.text == 'aqueles' or el.text == 'Aqueles' or el.text == 'AQUELES' or el.text == 'aquellas' or el.text == 'Aquellas' or el.text == 'AQUELLAS' or el.text == 'aquelles' or el.text == 'Aquelles' or el.text == 'AQUELLES':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0FP0')
            el.set('lemma', 'aquell')


# iterate all dirs
for root, dirs, files in os.walk(ROOT):

    # iterate all files
    for file in files:
        if file.endswith(ext):
            # join root dir and file name
            file_path = os.path.join(ROOT, file)

            # load root element from file
            root = et.parse(file_path).getroot()

            # recursively change  elements from xml
            xml_change(root)
    
        

            # init tree object from root
            tree = et.ElementTree(root)

            # save cleaned xml tree object to file. Important to specify encoding
                
            tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)

CodePudding user response：

Code is evaluating all conditions while only one would be met each time. One possible optimization is to make them if-elseif

    if el.text == 'aquest' or el.text == 'Aquest' or el.text == 'AQUEST' or el.text == 'aqueix' or el.text == 'Aqueix' or el.text == 'AQUEIX':

        print(el.text)
        print('Current value is:', el.get('lemma'), el.get('xpos'))
        el.set('xpos', 'DD0MS0')
        el.set('lemma', 'aquest')
    else if el.text == 'aquel' or el.text == 'Aquel' or el.text == 'AQUEL' or el.text == 'aquell' or el.text == 'Aquell' or el.text == 'AQUELL':
        print(el.text)
        print('Current value is:', el.get('lemma'), el.get('xpos'))
        el.set('xpos', 'DD0MS0')
        el.set('lemma', 'aquell')
    # other else if here

Also, plain XPath could be used to avoid regular expressions

//tok[following-sibling::tok[1][starts-with(@xpos, "N")]]