Home > Blockchain >  A faster way of extracting XML data than nested loops?
A faster way of extracting XML data than nested loops?

Time:06-22

I have an XML file that has the below format:

<JMdict>
...
<entry>
        <ent_seq>2232410</ent_seq>
        <k_ele>
                <keb>筆おろし</keb>
        </k_ele>
        <k_ele>
                <keb>筆下ろし</keb>
        </k_ele>
        <k_ele>
                <keb>筆降ろし</keb>
                <ke_inf>&iK;</ke_inf>
        </k_ele>
        <r_ele>
                <reb>ふでおろし</reb>
        </r_ele>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>using a new brush for the first time</gloss>
        </sense>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>doing something for the first time</gloss>
        </sense>
        <sense>
                <pos>&n;</pos>
                <pos>&vs;</pos>
                <gloss>man losing his virginity (esp. to an older woman)</gloss>
        </sense>
</entry>
...
</JMdict>

Link to the whole XML file: http://ftp.edrdg.org/pub/Nihongo/JMdict_e.gz

This is basically an electronic Japanese/English dictionary. There are many entry tags. I'm trying to create a search function that will return the ent_seq number based on the text values in any of the keb, reb, and gloss tags.

I have the bellow code which does what I need it to do but is somewhat slow (438 ms). This seq number will then be used to find data in another dataset and if I plan on using it in a web app, I would like it to be faster. Is there a way?

from xml.etree import ElementTree as ET

tree = ET.parse("../../resources/JMdict_e.xml")
root = tree.getroot()

search_term = '筆おろし'
seq_tags = []

for dictionary in root.iter('JMdict'):
    
    for child in dictionary:
        
        for grandchild in child:
            if grandchild.tag == 'ent_seq':
                ent_seq = grandchild.text
                
            for greatgrandchild in grandchild:
                if greatgrandchild.tag in ['keb','reb','gloss']:
                    if greatgrandchild.text == search_term:
                        seq_tags.append(ent_seq)

                    
print(seq_tags)

Any help and tips would be most appreciated.

CodePudding user response:

Using xpath to search on more than 1 element in the same expression (could be an AND/OR expression with more than 2 conditions)

from lxml import etree
import time
import io

st = time.process_time()
parser = etree.XMLParser(compact=True, huge_tree=True, resolve_entities=False)
with open('/home/luis/tmp/JMdict_e', 'rb') as f:
    et = time.process_time()
    res = et - st
    print('read file:', res, 'seconds')
    tree = etree.parse(f, parser)
    
    et = time.process_time()
    res1 = et - res
    print('parse:', res1, 'seconds')

    slist = tree.xpath('//entry[k_ele/keb = "筆おろし"]/ent_seq | //entry[r_ele/reb = "エヌきょう"]/ent_seq')
    #slist = tree.xpath('//entry[1]/k_ele/keb')
    et = time.process_time()
    res2 = et - res1
    print('xpath:', res, 'seconds')
    
    #print(slist)
    for d in slist:
        print( d.text)

et = time.process_time()
res = et - st
print('CPU Execution time:', res, 'seconds')
  • Related