I know how to use awk to remove duplicate lines in a file:
awk '!x[$0] ' myfile.txt
But how can I remove the duplicates only if there are more than two occurrences of this duplicate?
For example:
apple
apple
banana
apple
pear
banana
cherry
would become:
banana
pear
banana
cherry
Thanks in advance!
CodePudding user response:
I would harness GNU AWK
for this task following following way, let file.txt
content be
apple
apple
banana
apple
pear
banana
cherry
then
awk 'FNR==NR{cnt[$0] =1;next}cnt[$0]<=2' file.txt file.txt
gives output
banana
pear
banana
cherry
Explanation: This is 2-pass approach. FNR=NR
(current file number of row equal to total number of row) does hold true only for 1st file, here I simply count number of occurences in file.txt
by increasing ( =
) value in array cnt
under key being whole line ($0
) by 1
then I instruct GNU AWK
to go to next
line as I do not want to do anything else. After that only lines which fullfill number of occurrences is less or equal two are outputed. Note: file.txt file.txt
is intentional.
(tested in gawk 4.2.1)
CodePudding user response:
If you don't care about output order, this would do what you want without reading the whole file into memory in awk:
$ sort file | awk '
$0!=prev { if (cnt<3) printf "%s", vals; prev=$0; cnt=vals="" }
{ vals=vals $0 ORS; cnt }
END { if (cnt<3) printf "%s", vals }
'
banana
banana
cherry
pear
The output of sort
has all the values grouped together so you only need to look at the count when the values change to know how many of the previous value there were. sort
still has to consider the whole input but it's designed to handle massive files by using demand paging, etc. and so is far more likely to be able to handle huge files than reading it all into memory in awk.
If you do care about output order you could use a DSU approach, see How to sort data based on the value of a column for part (multiple lines) of a file?