I want to remove all duplicates from a file but ignoring the first 2 columns, I mean don't comparing those columns.
This is my example input:
111 06:22 apples, bananas and pears
112 06:28 bananas
113 07:07 apples, bananas and pears
114 07:23 apples and bananas
115 08:01 bananas and pears
116 08:23 pears
117 09:22 apples, bananas and pears
118 12:23 apples and bananas
I want this output:
111 06:22 apples, bananas and pears
112 06:28 bananas
114 07:23 apples and bananas
115 08:01 bananas and pears
116 08:23 pears
I've tried this bellow, but it only compares the third column and ignores the rest of the line:
awk '!seen[$3] ' sample.txt
CodePudding user response:
Store $0
to a temporary variable, set $1
and $2
to empty, then use newly composed $0
as key:
awk '{ t = $0; $1 = $2 = "" } !seen[$0] { print t }' sample.txt
CodePudding user response:
With GNU sort and uniq:
sort -k 3 file | uniq -f 2
Output:
114 07:23 apples and bananas 111 06:22 apples, bananas and pears 112 06:28 bananas 115 08:01 bananas and pears 116 08:23 pears
See: man sort
and man uniq
CodePudding user response:
With GNU awk:
awk 'BEGIN{FIELDWIDTHS="3 2 5 2 *"} !seen[$5] ' file
Output:
111 06:22 apples, bananas and pears 112 06:28 bananas 114 07:23 apples and bananas 115 08:01 bananas and pears 116 08:23 pears
From man awk
:
FIELDWIDTHS: A whitespace-separated list of field widths. When set, gawk parses the input into fields of fixed width, instead of using the value of the FS variable as the field separator.