Home > Software engineering >  Remove duplicates rows from large csv with multiline fields
Remove duplicates rows from large csv with multiline fields

Time:04-26

Given the following csv, with multiline fields:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"1","line 1
line 2"

... which displays as:

id text
1 line 1
line 2
2 line 1
line 2
1 line 1
line 2

If I use the following awk command to remove duplicate rows from this csv based on the id (column 1):

awk -F, '!x[$1]  ' 'file-01.csv' > 'file-01-deduped.csv'

I end up with:

"id","text"
"1","line 1
line 2"
"2","line 1

...which displays as:

id text
1 line 1
line 2
2 line 1

This is an oversimplified example, but it seems awk doesn't play well with multiline fields. Perhaps I'm missing something though.

Additional info: I'm writing these csv's according to RFC4180 standards—most notably, fields containing line breaks, double quotes, and commas are enclosed in double-quotes. And double quotes appearing inside a field are escaped with a preceding double quote.

Also, I'm writing the csv in Node/JS, but I found awk to be a really simple/fast way to dedupe very large files in the past—none had multiline fields though.

I'm by no means bound to awk—I'm open to any/all suggestions—just wanted to be clear about what I've tried. Thanks!

CodePudding user response:

Awk is not csv aware, so it's not really the right tool for the job. There are a few csv implementations floating around the internets, maybe you could take a look at them.

You did mention the file being large, but if it fits your memory, this is a variation of something I needed a few weeks back. It's GNU awk using FPAT so it's not really fast:

$ gawk '
BEGIN {
    RS="^$"                                # read in whole file
    FPAT="([^,\n]*)|(\"(\"\"|[^\"]) \")"   # regex magic
    OFS=","
}
{
    for(i=1;i<NF;i =2)                     # iterate fields 2 at a time
        if(!a[$i]  )                       # if first field not seen before
            print $i,$(i 1)                # output 2 fields
}' file

Test data:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"
"1","line 1
line 2"

Output:

"id","text"
"1","line 1
line 2"
"2","line 1
line 2"
"3"," ""line 1""
line 2"
"4",""
"5","line 1,
line 2"

I don't know how many ways it can fail you, tho.

CodePudding user response:

I’d check out GoCSV and its unique subcommand.

For your input, the command to keep only the first row amongst duplicate rows based on the text column would be:

gocsv unique -c text input.csv
  • Related