Given the following csv, with multiline fields:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"
... which displays as:
id | text |
---|---|
1 | line 1 line 2 |
2 | line 1 line 2 |
1 | line 1 line 2 |
If I use the following awk
command to remove duplicate rows from this csv based on the id (column 1):
awk -F, '!x[$1] ' 'file-01.csv' > 'file-01-deduped.csv'
I end up with:
"id","text"
"1","line 1
line 2"
"2","line 1
...which displays as:
id | text |
---|---|
1 | line 1 line 2 |
2 | line 1 |
This is an oversimplified example, but it seems awk
doesn't play well with multiline fields. Perhaps I'm missing something though.
Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.
Also, I'm writing the csv in Node/JS, but I found awk
to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.
I'm by no means bound to awk
—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!
CodePudding user response:
Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.
You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT
so it's not really fast:
$ gawk '
BEGIN {
RS="^$" # read in whole file
FPAT="([^,\n]*)|(\"(\"\"|[^\"]) \")" # regex magic
OFS=","
}
{
for(i=1;i<NF;i =2) # iterate fields 2 at a time
if(!a[$i] ) # if first field not seen before
print $i,$(i 1) # output 2 fields
}' file
Test data:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"
Output:
"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
I don't know how many ways it can fail you, tho.
CodePudding user response:
I’d check out GoCSV and its unique subcommand.
For your input, the command to keep only the first row amongst duplicate rows based on the text column would be:
gocsv unique -c text input.csv