Home > Software engineering >  Remove duplicates ignoring specific columns
Remove duplicates ignoring specific columns

Time:01-31

I want to remove all duplicates from a file but ignoring the first 2 columns, I mean don't comparing those columns.

This is my example input:

111  06:22  apples, bananas and pears
112  06:28  bananas
113  07:07  apples, bananas and pears
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears
117  09:22  apples, bananas and pears
118  12:23  apples and bananas

I want this output:

111  06:22  apples, bananas and pears
112  06:28  bananas
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears

I've tried this bellow, but it only compares the third column and ignores the rest of the line:

awk '!seen[$3]  ' sample.txt

CodePudding user response:

Store $0 to a temporary variable, set $1 and $2 to empty, then use newly composed $0 as key:

awk '{ t = $0; $1 = $2 = "" } !seen[$0]   { print t }' sample.txt

CodePudding user response:

With GNU sort and uniq:

sort -k 3 file | uniq -f 2

Output:

114  07:23  apples and bananas
111  06:22  apples, bananas and pears
112  06:28  bananas
115  08:01  bananas and pears
116  08:23  pears

See: man sort and man uniq

CodePudding user response:

With GNU awk:

awk 'BEGIN{FIELDWIDTHS="3 2 5 2 *"} !seen[$5]  ' file

Output:

111  06:22  apples, bananas and pears
112  06:28  bananas
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears

From man awk:

FIELDWIDTHS: A whitespace-separated list of field widths. When set, gawk parses the input into fields of fixed width, instead of using the value of the FS variable as the field separator.

  • Related