I have an annotated dataset in txt.knowtator.xml
format
<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
<annotation>
<mention id="EHOST_Instance_93" />
<annotator id="01">Unknown</annotator>
<span start="127" end="237" />
<spannedText>Omeprazole</spannedText>
<creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_93">
<mentionClass id="Treatment">Omeprazole</mentionClass>
</classMention>
<annotation>
<mention id="EHOST_Instance_94" />
<annotator id="01">Unkown</annotator>
<span start="600" end="612" />
<spannedText>Tegretol</spannedText>
<creationDate>Wed Mar 11 09:55:11 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_94">
<mentionClass id="Treatment">Tegretol</mentionClass>
</annotations>
I need to get it into standoff BRAT format (.ann
), such as:
T1 Treatment 127 137 Omeprazole
T2 Treatment 600 612 Tegretol
Is there any available tool for converting/parsing?
CodePudding user response:
see below
import xml.etree.ElementTree as ET
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<annotations textSource="file.txt">
<annotation>
<mention id="EHOST_Instance_93" />
<annotator id="01">Unknown</annotator>
<span start="127" end="237" />
<spannedText>Omeprazole</spannedText>
<creationDate>Wed Mar 11 09:52:01 GMT 2010</creationDate>
</annotation>
<classMention id="EHOST_Instance_93">
<mentionClass id="Treatment">Omeprazole</mentionClass>
</classMention>
</annotations>'''
root = ET.fromstring(xml)
print(f'T1 Treatment {root.find(".//span").attrib["start"]} {root.find(".//span").attrib["end"]} {root.find(".//spannedText").text}')
output
T1 Treatment 127 237 Omeprazole