I have two files. The first file contains two columns of data, roughly 10,000 lines. The other is a huge .xml file (GB ) were certain information needs to be replaced.
Currently, I am using a while loop to read in both columns of data, then use sed to replace in the xml file. This unfortunately is taking 5 hours to run, and am in need of a more efficient process.
File 1:
123456789 987654321
234567891 876543219
345678912 765432198
456789123 654321987
...
File 2:
...
<Tag>123456789</Tag>
...
Current Script:
#! /bin/bash
count=$(cat $1 | wc -l)
while read -r old new; do
((count-=1))
echo -e "Old value: ${old} - New value: ${new} - Values Left: ${count}"
sed -i "s/"$old"/"$new"/" $2
done < $1
./script.sh file1.txt file2.xml
CodePudding user response:
As usual, sed
or anything else involving regular expressions is the wrong approach to take with XML documents. Here's a perl script that uses the XML::Twig
module (Install through your OS's package manager or favorite CPAN client) to efficiently handle huge documents:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
# Takes two arguments - the one of lines tags to change and what to
# change to, and the input XML document. Outputs to standard output;
# redirect to a file if needed.
my ($tag_file, $xml_file) = @ARGV;
open my $tags, "<", $tag_file or die "Unable to open $tag_file: $!\n";
my %replacements;
while (<$tags>) {
chomp;
my ($from, $to) = split;
$replacements{$from} = $to;
}
close $tags;
my $xml = XML::Twig->new(
twig_handlers => {
Tag => sub {
my $id = $_->text;
if (exists $replacements{$id}) {
$_->set_text($replacements{$id});
}
$_->flush;
}
},
pretty_print => 'indented' # Use 'none' instead to save space in big files
# that don't need to be human-readable.
);
$xml->parsefile($xml_file);
$xml->flush;
Example usage:
$ cat input.xml
<?xml version="1.0"?>
<Document>
<Tag>123456789</Tag>
<Tag>234567891</Tag>
<Tag>345678912</Tag>
<OtherTag>456789123</OtherTag>
</Document>
$ perl demo.pl input.txt input.xml
<?xml version="1.0"?>
<Document>
<Tag>987654321</Tag>
<Tag>876543219</Tag>
<Tag>765432198</Tag>
<OtherTag>456789123</OtherTag>
</Document>
CodePudding user response:
With GNU sed, a regex bash and process substitution:
sed -f <(sed -E 's#([^ \t] )[ \t] ([^ \t] )#s|<Tag>\1</Tag>|<Tag>\2</Tag>|g#' file1.txt) file2.xml
I switched from common s///
to s###
and s|||
.
CodePudding user response:
Assumptions:
- replacement strings are to be processed as whole 'words' (ie, they are not to be replaced if they show up as a substring of a larger string of text
- replacements are only to be processed between
<Tag>
and</Tag>
delimiters <Tag>
and the matching</Tag>
are always on the same line (ie, no linebreaks between the two delimiters)
Sample data:
$ cat tags.xml
...
<Tag>123456789</Tag>
<Don't change me>123456789</Tag>
<Tag>XX_123456789_YY</Tag>
...
$ cat replacements
123456789 987654321
234567891 876543219
345678912 765432198
456789123 654321987
One awk
idea:
awk '
FNR==NR { rep[$1]=$2; next}
{ for (old in rep)
gsub("Tag>" old "<","Tag>" rep[old] "<")
}
1
' replacements tags.xml
This generates:
...
<Tag>987654321</Tag>
<Don't change me>123456789</Tag>
<Tag>XX_123456789_YY</Tag>
...