Home > Net >  Perl, use regex to find a match and replace just the last character of the match (in this case a lin
Perl, use regex to find a match and replace just the last character of the match (in this case a lin

Time:10-16

I have to clean several csv files before i put them in a database, some of the files have a unexpected linebreak in the middle of the line, as the line should always end with a number i managed to fix the files with this one liner:

perl -pe 's/[^0-9]\r?\n//g'

while it did work it also replaces the last char before the line break

foob
ar

turns into

fooar

Is there any one liner perl that i can call that would follow the same rule without replacing the last char before the linebreak

CodePudding user response:

Just capture the last char and put it back:

perl -pe 's/([^0-9])\r?\n/\1/g'

CodePudding user response:

One way is to use \K lookbehind

perl -pe 's/[^0-9]\K\r?\n//g'

Now it drops all matches before it so only what follows is subject to the replacement side.


However, I'd rather recommend to process your CSV with a library, even as it's a little more code. There's already been one problem, that linefeed inside a field, what else may be there? A good library can handle a variety of irregularities.

A simple example with Text::CSV

use warnings;
use strict;
use feature 'say';

use Text::CSV;

my $file = shift or die "Usage: $0 file.csv\n";

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 }); 

open my $fh, '<', $file  or die "Can't open $file: $!";

while (my $row = $csv->getline($fh)) { 
    s/\n //g for @$row; 
    $csv->say(\*STDOUT, $row);
}

Consider other constructor options (available via accessors as well) that are good for all kinds of unexpected problems. Like allow_whitespace for example.

This can be done as a command-line program ("one-liner") as well, if there is a reason for that. The library's functional interface via csv is then very convenient

perl -MText::CSV=csv -we' 
   csv in => *ARGV, on_in => sub { s/\n //g for @{$_[1]} }' filename

With *ARGV the input is taken either from a file named on command line or from STDIN.

  • Related