Removing a string from a file using Perl-CodePudding

I have a file, and I need to remove a string whenever it appears in the file. The file contains the following text (example):

<RECORD><JOSE><?xml version="1.0" encoding="UTF-8" standalone="no" ?><JUAN><ISMAEL><?xml version="1.0" encoding="UTF-8" standalone="no" ?></ISMAEL><NEWFILE><?xml version="1.0" encoding="UTF-8" standalone="no" ?></NEWFILE></RECORD>

I need to remove this string every time it appears in the file.
String to be removed: <?xml version="1.0" encoding="UTF-8" standalone="no" ?>

I started using Perl this week, and I still have lots to learn. This is the code I have so far but is not working.

use strict;
use warnings;

my $dir = path('D:\Programs\PERL\perl_tests'); # /dir

my $file = $dir->child("tobeclean.txt"); # /file.txt

open(REMFILE,"<",$file) || die "couldn't open $file: $!\n";

while (<REMFILE>) {
     s{<?xml version="1.0" encoding="UTF-8" standalone="no" ?>}{};
    print;
}

close(REMFILE);

CodePudding user response：

The ? is a meta-character. What you have should work if you just escape it:

s{<\?xml version="1.0" encoding="UTF-8" standalone="no" \?>}{};

Unescaped, the ? means that the previous atomic may or may not be present, so ab?c matches abc or ac. Note that . is also a metacharacter, and should probably be escaped as well, but now you're down the rabbit hole. It would probably be best to do:

my $k=quotemeta(q/<?xml version="1.0" encoding="UTF-8" standalone="no" ?>/);
s{$k}{};

or similar to ensure that you get exactly what you want. Or search a fixed string using something like:

s{\Q<?xml version="1.0" encoding="UTF-8" standalone="no" ?>\E}{}'

CodePudding user response：

Perl makes it easy to iterate over a file and modify its contents. You can do it from the command line.

perl -ipe's/\Q<?xml version="1.0" encoding="UTF-8" standalone="no" ?>//g' file.xml

This assumes that this string is always on a single line and doesn't span lines.

CodePudding user response：

The regex you try is failing because ? is a meta-character -- it has a special meaning in a regex pattern (it conditions the previous match), as explained already. So you escape it and all is well.

But, what if that phrase comes spread over multiple lines in a file? Then reading the file line-by-line can never find the whole thing in one regex. Is it absolutely certain that the phrase is always entirely on one line?

To be safe, I'd suggest then to read the file into a string ("slurp" it). Then there may be linefeeds inside the phrase, if it is indeed over multiple lines, so use a more general pattern.

use warnings;
use strict;
use feature 'say';

use Path::Tiny;

my $file = shift // die "Usage: $0 file\n";

my $text = path($file)->slurp;

$text =~ s{<\?xml [^>]* >}{}xg;

say $text;

The pattern [^>]* picks up all characters that are not >, so everything up to the first >. But, this tag-like construct ("prolog", <?xml...) need be closed with ?> so perhaps a more careful pattern could be

$text =~ s{<\?xml .*?(?= \?\s*>) \?\s*>}{}xg;

This uses a lookahead assertion, (?= ...). I use Path::Tiny since the question clearly does (even though the use statement itself is left out!)

But really, is this file a proper format of any kind? An XML? (You are removing an XML document "prolog." The shown string is invalid as XML in many ways but perhaps the actual data is different.) If so, by all means better read it with a library.

Some good libraries to mention: XML::LibXML, XML::Twig, Mojo::DOM.