Hello this is my first post, if something is not clear, please say so! I have this xml file from which I have to extract all the names found between the square brackets of eanch transc tag (the one inside newsFrom) and then put them in a new tag called person under it. Obviously if there are two names I need two separate person tags with their respective names, as is done for newstopic.
This is what I need
<newsFrom>
<from date="15/01/1649" dateUnsure="y">London</from>
<transc>Questo Parlamento generale Farfax [Thomas Fairfax, 3rd Lord Fairfax of Cameron] et suo consiglio dio et ordinato di pocessare il re [Charles I, King of England]</transc>
<person>Thomas Fairfax, 3rd Lord Fairfax of Cameron</person>
<person>Charles I, King of England</person>
<newsTopic>Military</newsTopic>
<wordCount>103</wordCount>
<position>1</position>
</newsFrom>
This is the XML file
<news xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="news.xsd">
<xmlCorpusDate>2022-10-16</xmlCorpusDate>
<xmlCorpusTime>15:17:52</xmlCorpusTime>
<newsDocument>
<docid>50992</docid>
<repository>Archivio di Stato di Firenze</repository>
<collection>Mediceo del Principato</collection>
<volume>4202</volume>
<newsHeader>
<hub>London</hub>
<date>15/01/1649</date>
<transc>Di Londra 15 gennaio 1648 ab Incarnatione</transc>
<newsFrom>
<from date="15/01/1649" dateUnsure="y">London</from>
<transc>Questo Parlamento generale Farfax [Thomas Fairfax, 3rd Lord Fairfax of Cameron] et suo consiglio dio et ordinato di pocessare il re [Charles I, King of England]</transc>
<newsTopic>Military</newsTopic>
<wordCount>103</wordCount>
<position>1</position>
</newsFrom>
<newsFrom>
<from date="15/01/1649" dateUnsure="y">Manchester</from>
<transc>Ieri è giunto Rossini [Cardinal Rossini] et suo figlio [Gianmarco Rossini]</transc>
<newsTopic>Politics</newsTopic>
<wordCount>53</wordCount>
<position>2</position>
</newsFrom>
<writtenPagesNo>5</writtenPagesNo>
</newsHeader>
</newsDocument>
<newsDocument>
<docid>50492</docid>
<repository>Archivio di Stato di Firenze</repository>
<collection>Mediceo del Principato</collection>
<volume>4202</volume>
<newsHeader>
<hub>London</hub>
<date>21/01/1649</date>
<transc>Di Londra 21 gennaio 1648 ab Incarnatione</transc>
<newsFrom>
<from date="21/01/1649" dateUnsure="y">London</from>
<transc>Il consiglio di guerra con la Camera [English Parliament]</transc>
<newsTopic>Government</newsTopic>
<newsTopic>Politics</newsTopic>
<wordCount>78</wordCount>
<position>1</position>
</newsFrom>
<newsFrom>
<from date="21/01/1649" dateUnsure="y">Manchester</from>
<transc>Si è data notizia [Marco Cioni] di cose di poco conto</transc>
<newsTopic>Politics</newsTopic>
<wordCount>144</wordCount>
<position>2</position>
</newsFrom>
<writtenPagesNo>5</writtenPagesNo>
</newsHeader>
</newsDocument>
</news>
As for the extraction of names, this was relatively easy, in fact I created the following code in python
import xml.etree.ElementTree as ET
import re
file = open("1649.xml")
tree=ET.parse('1649.xml')
root=tree.getroot()
for document in root.findall("newsDocument"):
names=document.find("./newsHeader/newsFrom/transc").text
people=re.findall("\[(.*?)\]",names)
The problem now arises in creating the new tags, assigning them names extracted from the text and making sure that each individual name corresponds to the exact text. I've tried different ways, looked at the library guide, but I can't, what I can do at best is to get a messy list at the head of the file. Thanks to anyone who can help me
CodePudding user response:
In this case, it's easier to use lxml rather than ElementTree, because of lxml's better support for xpath.
So try this:
from lxml import etree
import re
tree=etree.parse('1649.xml')
#find all <trasnc> elements
trs = root.xpath(".//transc")
for t in trs:
#use regex to find the data between "[" and "]"
persons = re.findall('(?<=\[)([^]] )(?=\])', t.text)
if len(persons)>0:
#EDIT
for person in set(persons):
#create a new element using f-strings
np = etree.fromstring(f"<person>{person}</person>")
#add the new element in the appropriate place
t.addnext(np)
#pretty print
etree.indent(root, space=' ', level=0)
print(etree.tostring(root).decode())
The output should be your expected output.