Home > Blockchain >  Extracting data from an XLIFF file and creating a data frame
Extracting data from an XLIFF file and creating a data frame

Time:08-28

I have an XLIFF file with the following structure.

<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.2" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 http://docs.oasis-open.org/xliff/v1.2/os/xliff-core-1.2-strict.xsd">
    <file original="" datatype="plaintext" xml:space="preserve" source-language="en" target-language="es-419">
        <header>
            <tool tool-id="tool" tool-name="tool" />
        </header>
        <body>
            <trans-unit id="tool-123456789-1" resname="123456::title">
                <source>Name 1 </source>
                <target state="final">Name 1 target language </target>
            </trans-unit>
            <trans-unit id="tool-123456780-1" resname="123456::summary">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
            </trans-unit>
            <trans-unit id="tool-123456790-1" resname="123456::relevant">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
            </trans-unit>
            <trans-unit id="tool-123456791-1" resname="123456::description">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
            </trans-unit>
            <trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
                <source>Lorem Ipsum </source>
                <target state="final">Lorem Ipsum local</target>
            </trans-unit>
            <trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
            </trans-unit>
            <trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
                <source>Lorem Ipsum </source>
                <target state="final">Lorem Ipsum local</target>
            </trans-unit>
            <trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
                <source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
                <target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
            </trans-unit>
                        
        </body>
    </file>
</xliff>


I want to extract the content on the trans-unit, source, and target tags to build a data frame with the following structure:

TAG SOURCE TARGET
Title Source text Target text
Description Source text Target text
Summary Source text Target text
Relevant Source text Target text
From area code Source text Target text

I tried building a data frame with all tags and text using the following code, so then I could filter the rows that contain the data I need.

import xml.etree.ElementTree as ET
tree=ET.parse('583197.xliff')
root=tree.getroot()

# print(root)
store_items = []
all_items = []

for elem in tree.iter():
        
        tag=elem.tag()
        attri = elem.attrib()
        text = elem.text()
      
        store_items = [attri,text]
        all_items.append(store_items)

xmlToDf = pd.DataFrame(all_items, columns=[
'Attri','Text'])

print(xmlToDf.to_string(index=False))

How can I extract specific tags, attributes, and text from an XLIFF file so I can build a data frame?

CodePudding user response:

Try:

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse("your_file.xml")
root = tree.getroot()

data = []
for tu in root.findall(".//{urn:oasis:names:tc:xliff:document:1.2}trans-unit"):
    source = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}source")
    target = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}target")
    data.append(
        {
            "TAG": tu.attrib["resname"].split("::")[-1],
            "SOURCE": source.text,
            "TARGET": target.text,
        }
    )

df = pd.DataFrame(data)
print(df)

Prints:

              TAG                                                                      SOURCE                                                                                     TARGET
0           title                                                                     Name 1                                                                     Name 1 target language 
1         summary  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
2        relevant  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
3     description  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
4  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
5         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
6  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
7         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
  • Related