Remove duplicates rows from large csv with multiline fields-CodePudding

Given the following csv, with multiline fields:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"

... which displays as:

id	text
1	line 1 line 2
2	line 1 line 2
1	line 1 line 2

If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):

awk -F, '!x[$1]  ' 'file-01.csv' > 'file-01-deduped.csv'

I end up with:

"id","text"
"1","line 1
line 2"
"2","line 1

...which displays as:

id	text
1	line 1 line 2
2	line 1

This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.

Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.

Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.

I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!

CodePudding user response：

Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.

You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:

$ gawk '
BEGIN {
    RS="^$"                                # read in whole file
    FPAT="([^,\n]*)|(\"(\"\"|[^\"]) \")"   # regex magic
    OFS=","
}
{
    for(i=1;i<NF;i =2)                     # iterate fields 2 at a time
        if(!a[$i]  )                       # if first field not seen before
            print $i,$(i 1)                # output 2 fields
}' file

Test data:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"

Output:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"

I don't know how many ways it can fail you, tho.

CodePudding user response：

I’d check out GoCSV and its unique subcommand.

For your input, the command to keep only the first row amongst duplicate rows based on the text column would be:

gocsv unique -c text input.csv