Home > Software engineering >  Concatenate line that starts with `t with previous line with powershell
Concatenate line that starts with `t with previous line with powershell

Time:12-17

How do I go about removing the end of line when going through a TXT file line by line? What I want to do is produce a simple TSV but I want lines starting with "ES-ES " to be a column after the normal text.

I'm loading the XML like this to get clean text from the elements:

[xml]$xml = Get-Content -Path PATH/TO/XML.xml
$tmx.tmx.body.tu.tuv > C:\Users\XXX\test.txt

Then doing some changes to it (remove 3 header lines, trim trailing spaces and replace lines that start with "ES-ES " with a tab) like this:

$outfile = "C:\Users\XXX\test.txt"
$lines=(Get-Content $outfile) |select -Skip 3 | foreach{ $_.Trim() -replace "ES-ES ","`t"}
$lines > $outfile

But I can't seem to be able to remove the "`n" at the end of the previous line (because I'm going through it line-by-line, maybe?). How would I concatenate "TEST " lines with the previous line?

Thanks.

EDIT:

Example of the XMLs that I have (it's a simple translation memory XML - TMX).

<?xml version="1.0" encoding="UTF-16LE"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
    <header o-tmf="TW4Win 2.0 Format" creationtool="XXX" creationtoolversion="3.XX" segtype="sentence" datatype="PlainText" adminlang="EN-US" srclang="EN-GB">
    </header>
    <body>
        <tu creationdate="20XX" creationid="XXXXX">
            <prop type="Year">20XX</prop>
            <prop type="Database">General</prop>
            <tuv xml:lang="EN-GB">
                <seg>Segment with original text in source language (English).</seg>
            </tuv>
            <tuv xml:lang="ES-ES" changedate="20XXXXXXXXXXX">
                <seg>Segment with translated text in target language (Spanish).</seg>
            </tuv>
        </tu>
        <tu creationdate="20XX" creationid="XXXXX">
            <prop type="Year">20XX</prop>
            <prop type="Database">General</prop>
            <tuv xml:lang="EN-GB">
                <seg>Another segment with original text in source language (English).</seg>
            </tuv>
            <tuv xml:lang="ES-ES" changedate="20XXXXXXXXXXX">
                <seg>Another segment with translated text in target language (Spanish).</seg>
            </tuv>
        </tu>
        <tu creationdate="20XX" creationid="XXXXX">
            <prop type="Year">20XX</prop>
            <prop type="Database">General</prop>
            <tuv xml:lang="EN-GB">
                <seg>Yet another one in English</seg>
            </tuv>
            <tuv xml:lang="ES-ES" changedate="20XXXXXXXXXXX">
                <seg>Yet another one in Spanish</seg>
            </tuv>
        </tu>

And the output I want is:

Segment with original text in source language (English).    Segment with translated text in target language (Spanish).
Another segment with original text in source language (English).    Another segment with translated text in target language (Spanish).
Yet another one in English  Yet another one in Spanish

I'm sure this could be done better in python or something, but I don't have permissions or access to any coding tools except for powershell. :\

CodePudding user response:

PowerShell natively parses XML, so there's no need to use regex at all here:

# read and parse xml document
$tmxDocument = [xml](Get-Content path\to\file.tmx)

# locate all the <tu> elements
$translationUnits = $tmxDocument.SelectNodes('//tu')

# create a new object for each unit, with the english and spanish 
# sentence strings as properties, then export to a TSV file
$translationsUnits |ForEach-Object {
    [pscustomobject]@{
        English = $_.tuv |Where-Object lang -eq EN-GB |ForEach-Object seg
        Spanish = $_.tuv |Where-Object lang -eq ES-ES |ForEach-Object seg
    }
} |Export-Csv path\to\output.tsv -Delimiter "`t" -NoTypeInformation

CodePudding user response:

There is no need for two processing steps:

Given that you're already using PowerShell's adaptation of the XML DOM (that is, using dot notation such as $xml.tmx.body.tu.tuv to drill down into the document), you can refine this approach to access the .lang (the xml:lang attribute value) and .seg properties (the value of the <seg> element):

# Parse the XML file.
# Note: A more robust version is:
#   ($xml = [xml]::new()).Load((Convert-Path XML.xml)
[xml] $xml = Get-Content -Raw XML.xml

# An ordered helper hashtable for collecting results before outputting them.
$langs = [ordered] @{}

$xml.tmx.body.tu.tuv | 
  ForEach-Object {
    # Save the <seq> text, keyed by its language
    $langs[$_.lang] = $_.seg
    # If the element at hand is Spanish:
    if ($_.lang -eq 'ES-ES') {
      # Output the collected texts separated with a tab char.
      $langs.Values -join "`t"
      $langs.Clear()
    }
  } | 
  Set-Content -Encoding utf8 test.txt

Note: If you want a header row in your file...

  • and don't mind having the field values enclosed in "..." (in PowerShell (Core) 7 only, you could avoid that), you can replace $langs.Values -join "`t" with [pscustomobject] $langs, and pipe to Export-Csv -Delimiter "`t" instead.

  • otherwise, you can simply output the header line before processing the data, by adding something like the following parameter to the ForEach-Object call:

    • -Begin { "EN-GB`tES-ES" }
  • Related