Linux/Bash Loop through 10k lines in file and replace strings using sed-CodePudding

I have two files. The first file contains two columns of data, roughly 10,000 lines. The other is a huge .xml file (GB ) were certain information needs to be replaced.

Currently, I am using a while loop to read in both columns of data, then use sed to replace in the xml file. This unfortunately is taking 5 hours to run, and am in need of a more efficient process.

File 1:

123456789    987654321
234567891    876543219
345678912    765432198
456789123    654321987
...

File 2:

...
<Tag>123456789</Tag>
...

Current Script:

#! /bin/bash

count=$(cat $1 | wc -l)

while read -r old new; do
        ((count-=1))
        echo -e "Old value: ${old}  -  New value: ${new}  -  Values Left: ${count}"

        sed -i "s/"$old"/"$new"/" $2
done < $1

./script.sh file1.txt file2.xml

CodePudding user response：

As usual, sed or anything else involving regular expressions is the wrong approach to take with XML documents. Here's a perl script that uses the XML::Twig module (Install through your OS's package manager or favorite CPAN client) to efficiently handle huge documents:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

# Takes two arguments - the one of lines tags to change and what to
# change to, and the input XML document. Outputs to standard output;
# redirect to a file if needed.

my ($tag_file, $xml_file) = @ARGV;
open my $tags, "<", $tag_file or die "Unable to open $tag_file: $!\n";
my %replacements;
while (<$tags>) {
    chomp;
    my ($from, $to) = split;
    $replacements{$from} = $to;
}
close $tags;

my $xml = XML::Twig->new(
    twig_handlers => {
        Tag => sub {
            my $id = $_->text;
            if (exists $replacements{$id}) {
                $_->set_text($replacements{$id});
            }
            $_->flush;
        }
    },
    pretty_print => 'indented' # Use 'none' instead to save space in big files
                               # that don't need to be human-readable.
    );
$xml->parsefile($xml_file);
$xml->flush;

Example usage:

$ cat input.xml
<?xml version="1.0"?>
<Document>
  <Tag>123456789</Tag>
  <Tag>234567891</Tag>
  <Tag>345678912</Tag>
  <OtherTag>456789123</OtherTag>
</Document>
$ perl demo.pl input.txt input.xml
<?xml version="1.0"?>
<Document>
  <Tag>987654321</Tag>
  <Tag>876543219</Tag>
  <Tag>765432198</Tag>
  <OtherTag>456789123</OtherTag>
</Document>

CodePudding user response：

With GNU sed, a regex bash and process substitution:

sed -f <(sed -E 's#([^ \t] )[ \t] ([^ \t] )#s|<Tag>\1</Tag>|<Tag>\2</Tag>|g#' file1.txt) file2.xml

I switched from common s/// to s### and s|||.

CodePudding user response：

Assumptions:

replacement strings are to be processed as whole 'words' (ie, they are not to be replaced if they show up as a substring of a larger string of text
replacements are only to be processed between <Tag> and </Tag> delimiters
<Tag> and the matching </Tag> are always on the same line (ie, no linebreaks between the two delimiters)

Sample data:

$ cat tags.xml
...
<Tag>123456789</Tag>
<Don't change me>123456789</Tag>
<Tag>XX_123456789_YY</Tag>
...

$ cat replacements
123456789    987654321
234567891    876543219
345678912    765432198
456789123    654321987

One awk idea:

awk '
FNR==NR { rep[$1]=$2; next}
        { for (old in rep)
              gsub("Tag>" old "<","Tag>" rep[old] "<")
        }
1
' replacements tags.xml

This generates:

...
<Tag>987654321</Tag>
<Don't change me>123456789</Tag>
<Tag>XX_123456789_YY</Tag>
...