RegEx: How to find all newline symbols not preceded by a double quote?-CodePudding

I want to remove all illegal newline symbols LF in a CSV file between double quotes.

"name", "created"
"David A","2022-04-04"
"Mark
B", "2022-04-09"
"Peter C", "2022-05-01"

The file is called name.csv for example

Now I can do it with

cat name.csv |
| perl -p -e 's/\n/!LF_SYMBOL!/g'        \
| perl -p -e 's/"!LF_SYMBOL!"/"!EOL!"/g' \
| perl -p -e 's/!LF_SYMBOL!//g'          \
| perl -p -e 's/!EOL!/\n/g'              \
> name_new.csv

But it's ugly. I want to remove all newlines not following a double quote

I have tried

perl -p -e 's/[^"]\n//' name.csv

and

perl -p -e 's/^[^"]\n//' name.csv

Both just removing all "\n for some reason.

Any ideas?

CodePudding user response：

It's usually best to use an existing parser rather than writing your own. This is no exception.

Text::CSV_XS has no problem handling line feeds in double-quoted fields.^[1]

It can even tolerate the whitespace after the comma if allow_whitespace => 1 is provided.

So I would use this module rather than writing my own parser.

All you need is this:

perl -MText::CSV_XS=csv -e'
   csv
      in => *ARGV,
      allow_whitespace => 1,
      on_in => sub { s/\n//g for @{ $_[1] }; };
' name.csv >name_new.csv

Output:

name,created
"David A",2022-04-04
MarkB,2022-04-09
"Peter C",2022-05-01

If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

Handling line feeds in fields require passing binary => 1, which is the default when using the csv function.

CodePudding user response：

Since you have a two-column "CSV"-like file and all you want is to remove line breaks inside the two fields, you can use a solution like

perl -0777 -i -pe 's/^("[^"]*(?:""[^"]*)*")\s*,\s*(?1)$/$&=~s!\R !!gr/gme' file.csv

See the main regex demo. When the match is found, the second regex, s!\R !!gr removes all line breaks inside the match.

Details:

^ - start of a line (due to m flag)
("[^"]*(?:""[^"]*)*") - Group 1: ", zero or more chars other than ", then zero or more repetitions of "" and then zero or more chars other than ", and then a " (matches a CSV field where literal double quotes are escaped with ")
\s*,\s* - a comma enclosed with zero or more whitespaces
(?1) - a subroutine that repeats Group 1 pattern
$ - end of a line (due to m flag).

CodePudding user response：

If this is a valid CSV file read it using a library, like Text::CSV. They don't have a problem with newlines embedded in fields and then it's simple to remove those

use warnings;
use strict;
use feature 'say';

use Text::CSV;

my $file = shift or die "Usage: $0 file.csv\n";

my $csv = Text::CSV->new(
    { binary => 1, auto_diag => 1, allow_whitespace => 1 }); 

open my $fh, '<', $file  or die "Can't open $file: $!";

while (my $row = $csv->getline($fh)) { 
    s/\n //g for @$row; 
    $csv->say(\*STDOUT, $row);
}

This can be done in a one-liner as well, if there is some benefit to that.

The posted text, as it stands, isn't valid CSV and cannot be directly parsed by a library since commas are missing between fields. I presume that those are typos since the question says "CSV" at multiple places. Please clarify.

The embedded newlines are accepted with the binary attribute while the extras spaces after commas (strictly invalid) are accepted with the allow_whitespace.