How can I use awk to remove duplicate entries in the same field with data separated with commas?-CodePudding

I am trying to call awk from a bash script to remove duplicate data entries of a field in a file.

Data Example in file1

data1 a,b,c,d,d,d,c,e

data2 a,b,b,c

Desired Output:

data1 a,b,c,d,e

data2 a,b,c

First I removed the first column to only have the second remaining.

cut --complement -d$'\t' -f1 file1 &> file2

This worked fine, and now I just have the following in file2:

a,b,c,d,d,d,c,e

a,b,b,c

So then I tried this code that I found but do not understand well:

awk '{
    for(i=1; i<=NF; i  )
            printf "%s", (!seen[$1]  ? (i==1?"":FS) $i: "" )
    delete seen; print ""
}' file2

The problem is that this code was for a space delimiter and mine is now a comma delimiter with variable values on each row. This code just prints the file as is and I can see no difference. I also tried to make the FS a comma by doing this, to no avail:

printf "%s", (!seen[$1]  ? (i==1?"":FS=",") $i: ""

CodePudding user response：

This is similar to the code you found.

awk -F'[ ,]' '
    {
        s = $1 " " $2
        seen[$2]  

        for (i=3; i<=NF; i  )
            if (!seen[$i]  ) s = s "," $i

        print s
        delete seen
    }
' data-file

-F'[ ,]' - split input lines on spaces and commas
s = ... - we could use printf like the code you found, but building a string is less typing
!seen[x] is a common idiom - it returns true only the first time x is seen
to avoid special-casing when to print a comma (as your sample code does with spaces), we simply add $2 to the print string and set seen[$2]
then for the remaining columns (3 .. NF), we add comma and column if it hasn't been seen before
delete seen - clear the array for the next line

CodePudding user response：

That code is right, you need to specify the delimiter and change $1 to $i.

$ awk -F ',' '{
    for(i=1; i<=NF; i  )
            printf "%s", (!seen[$i]  ? (i==1?"":FS) $i: "" )
    delete seen; print ""
}' /tmp/file1
data1 a,b,c,d,e
data2 a,b,c

CodePudding user response：

Using GNU sed if applicable

$ sed -E ':a;s/((\<[^,]*\>).*),\2/\1/;ta' input_file
data1 a,b,c,d,e
data2 a,b,c