I have an XML file that has the below format:
<JMdict>
...
<entry>
<ent_seq>2232410</ent_seq>
<k_ele>
<keb>筆おろし</keb>
</k_ele>
<k_ele>
<keb>筆下ろし</keb>
</k_ele>
<k_ele>
<keb>筆降ろし</keb>
<ke_inf>&iK;</ke_inf>
</k_ele>
<r_ele>
<reb>ふでおろし</reb>
</r_ele>
<sense>
<pos>&n;</pos>
<pos>&vs;</pos>
<gloss>using a new brush for the first time</gloss>
</sense>
<sense>
<pos>&n;</pos>
<pos>&vs;</pos>
<gloss>doing something for the first time</gloss>
</sense>
<sense>
<pos>&n;</pos>
<pos>&vs;</pos>
<gloss>man losing his virginity (esp. to an older woman)</gloss>
</sense>
</entry>
...
</JMdict>
Link to the whole XML file: http://ftp.edrdg.org/pub/Nihongo/JMdict_e.gz
This is basically an electronic Japanese/English dictionary. There are many entry tags. I'm trying to create a search function that will return the ent_seq number based on the text values in any of the keb, reb, and gloss tags.
I have the bellow code which does what I need it to do but is somewhat slow (438 ms). This seq number will then be used to find data in another dataset and if I plan on using it in a web app, I would like it to be faster. Is there a way?
from xml.etree import ElementTree as ET
tree = ET.parse("../../resources/JMdict_e.xml")
root = tree.getroot()
search_term = '筆おろし'
seq_tags = []
for dictionary in root.iter('JMdict'):
for child in dictionary:
for grandchild in child:
if grandchild.tag == 'ent_seq':
ent_seq = grandchild.text
for greatgrandchild in grandchild:
if greatgrandchild.tag in ['keb','reb','gloss']:
if greatgrandchild.text == search_term:
seq_tags.append(ent_seq)
print(seq_tags)
Any help and tips would be most appreciated.
CodePudding user response:
Using xpath to search on more than 1 element in the same expression (could be an AND/OR expression with more than 2 conditions)
from lxml import etree
import time
import io
st = time.process_time()
parser = etree.XMLParser(compact=True, huge_tree=True, resolve_entities=False)
with open('/home/luis/tmp/JMdict_e', 'rb') as f:
et = time.process_time()
res = et - st
print('read file:', res, 'seconds')
tree = etree.parse(f, parser)
et = time.process_time()
res1 = et - res
print('parse:', res1, 'seconds')
slist = tree.xpath('//entry[k_ele/keb = "筆おろし"]/ent_seq | //entry[r_ele/reb = "エヌきょう"]/ent_seq')
#slist = tree.xpath('//entry[1]/k_ele/keb')
et = time.process_time()
res2 = et - res1
print('xpath:', res, 'seconds')
#print(slist)
for d in slist:
print( d.text)
et = time.process_time()
res = et - st
print('CPU Execution time:', res, 'seconds')