How to convert txt.knowtator.xml file to .ann?-CodePudding

I have an annotated dataset in txt.knowtator.xml format

<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
    <annotation>
        <mention id="EHOST_Instance_93" />
        <annotator id="01">Unknown</annotator>
        <span start="127" end="237" />
        <spannedText>Omeprazole</spannedText>
        <creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_93">
        <mentionClass id="Treatment">Omeprazole</mentionClass>
    </classMention>
    <annotation>
        <mention id="EHOST_Instance_94" />
        <annotator id="01">Unkown</annotator>
        <span start="600" end="612" />
        <spannedText>Tegretol</spannedText>
        <creationDate>Wed Mar 11 09:55:11 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_94">
        <mentionClass id="Treatment">Tegretol</mentionClass>
</annotations>

I need to get it into standoff BRAT format (.ann), such as:

T1    Treatment 127 137    Omeprazole
T2    Treatment 600 612    Tegretol

Is there any available tool for converting/parsing?

CodePudding user response：

see below

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
    <annotation>
        <mention id="EHOST_Instance_93" />
        <annotator id="01">Unknown</annotator>
        <span start="127" end="237" />
        <spannedText>Omeprazole</spannedText>
        <creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
    </annotation>
    <classMention id="EHOST_Instance_93">
        <mentionClass id="Treatment">Omeprazole</mentionClass>
    </classMention>
</annotations>'''

root = ET.fromstring(xml)
print(f'T1    Treatment {root.find(".//span").attrib["start"]} {root.find(".//span").attrib["end"]} {root.find(".//spannedText").text}')

output

T1    Treatment 127 237 Omeprazole