I want to remove all illegal newline symbols LF in a CSV file between double quotes.
"name", "created"
"David A","2022-04-04"
"Mark
B", "2022-04-09"
"Peter C", "2022-05-01"
The file is called name.csv
for example
Now I can do it with
cat name.csv |
| perl -p -e 's/\n/!LF_SYMBOL!/g' \
| perl -p -e 's/"!LF_SYMBOL!"/"!EOL!"/g' \
| perl -p -e 's/!LF_SYMBOL!//g' \
| perl -p -e 's/!EOL!/\n/g' \
> name_new.csv
But it's ugly. I want to remove all newlines not following a double quote
I have tried
perl -p -e 's/[^"]\n//' name.csv
and
perl -p -e 's/^[^"]\n//' name.csv
Both just removing all "\n
for some reason.
Any ideas?
CodePudding user response:
It's usually best to use an existing parser rather than writing your own. This is no exception.
Text::CSV_XS has no problem handling line feeds in double-quoted fields.[1]
It can even tolerate the whitespace after the comma if allow_whitespace => 1
is provided.
So I would use this module rather than writing my own parser.
All you need is this:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
allow_whitespace => 1,
on_in => sub { s/\n//g for @{ $_[1] }; };
' name.csv >name_new.csv
Output:
name,created
"David A",2022-04-04
MarkB,2022-04-09
"Peter C",2022-05-01
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.
- Handling line feeds in fields require passing
binary => 1
, which is the default when using thecsv
function.
CodePudding user response:
Since you have a two-column "CSV"-like file and all you want is to remove line breaks inside the two fields, you can use a solution like
perl -0777 -i -pe 's/^("[^"]*(?:""[^"]*)*")\s*,\s*(?1)$/$&=~s!\R !!gr/gme' file.csv
See the main regex demo. When the match is found, the second regex, s!\R !!gr
removes all line breaks inside the match.
Details:
^
- start of a line (due tom
flag)("[^"]*(?:""[^"]*)*")
- Group 1:"
, zero or more chars other than"
, then zero or more repetitions of""
and then zero or more chars other than"
, and then a"
(matches a CSV field where literal double quotes are escaped with"
)\s*,\s*
- a comma enclosed with zero or more whitespaces(?1)
- a subroutine that repeats Group 1 pattern$
- end of a line (due tom
flag).
CodePudding user response:
If this is a valid CSV file read it using a library, like Text::CSV. They don't have a problem with newlines embedded in fields and then it's simple to remove those
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my $file = shift or die "Usage: $0 file.csv\n";
my $csv = Text::CSV->new(
{ binary => 1, auto_diag => 1, allow_whitespace => 1 });
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\n //g for @$row;
$csv->say(\*STDOUT, $row);
}
This can be done in a one-liner as well, if there is some benefit to that.
The posted text, as it stands, isn't valid CSV and cannot be directly parsed by a library since commas are missing between fields. I presume that those are typos since the question says "CSV" at multiple places. Please clarify.
The embedded newlines are accepted with the binary
attribute while the extras spaces after commas (strictly invalid) are accepted with the allow_whitespace
.