I am trying to call awk from a bash script to remove duplicate data entries of a field in a file.
Data Example in file1
data1 a,b,c,d,d,d,c,e
data2 a,b,b,c
Desired Output:
data1 a,b,c,d,e
data2 a,b,c
First I removed the first column to only have the second remaining.
cut --complement -d$'\t' -f1 file1 &> file2
This worked fine, and now I just have the following in file2:
a,b,c,d,d,d,c,e
a,b,b,c
So then I tried this code that I found but do not understand well:
awk '{
for(i=1; i<=NF; i )
printf "%s", (!seen[$1] ? (i==1?"":FS) $i: "" )
delete seen; print ""
}' file2
The problem is that this code was for a space delimiter and mine is now a comma delimiter with variable values on each row. This code just prints the file as is and I can see no difference. I also tried to make the FS a comma by doing this, to no avail:
printf "%s", (!seen[$1] ? (i==1?"":FS=",") $i: ""
CodePudding user response:
This is similar to the code you found.
awk -F'[ ,]' '
{
s = $1 " " $2
seen[$2]
for (i=3; i<=NF; i )
if (!seen[$i] ) s = s "," $i
print s
delete seen
}
' data-file
-F'[ ,]'
- split input lines on spaces and commass = ...
- we could useprintf
like the code you found, but building a string is less typing!seen[x]
is a common idiom - it returns true only the first timex
is seen- to avoid special-casing when to print a comma (as your sample code does with spaces), we simply add
$2
to the print string and setseen[$2]
- then for the remaining columns (3 .. NF), we add comma and column if it hasn't been seen before
delete seen
- clear the array for the next line
CodePudding user response:
That code is right, you need to specify the delimiter and change $1 to $i.
$ awk -F ',' '{
for(i=1; i<=NF; i )
printf "%s", (!seen[$i] ? (i==1?"":FS) $i: "" )
delete seen; print ""
}' /tmp/file1
data1 a,b,c,d,e
data2 a,b,c
CodePudding user response:
Using GNU sed
if applicable
$ sed -E ':a;s/((\<[^,]*\>).*),\2/\1/;ta' input_file
data1 a,b,c,d,e
data2 a,b,c