How can I remove duplicates only once an X number of occurrences is reached with awk?-CodePudding

I know how to use awk to remove duplicate lines in a file:

awk '!x[$0] ' myfile.txt

But how can I remove the duplicates only if there are more than two occurrences of this duplicate?

For example:

apple
apple
banana
apple
pear
banana
cherry

would become:

banana
pear
banana
cherry

Thanks in advance!

CodePudding user response：

I would harness GNU AWK for this task following following way, let file.txt content be

apple
apple
banana
apple
pear
banana
cherry

then

awk 'FNR==NR{cnt[$0] =1;next}cnt[$0]<=2' file.txt file.txt

gives output

banana
pear
banana
cherry

Explanation: This is 2-pass approach. FNR=NR (current file number of row equal to total number of row) does hold true only for 1st file, here I simply count number of occurences in file.txt by increasing ( =) value in array cnt under key being whole line ($0) by 1 then I instruct GNU AWK to go to next line as I do not want to do anything else. After that only lines which fullfill number of occurrences is less or equal two are outputed. Note: file.txt file.txt is intentional.

(tested in gawk 4.2.1)

CodePudding user response：

If you don't care about output order, this would do what you want without reading the whole file into memory in awk:

$ sort file | awk '
    $0!=prev { if (cnt<3) printf "%s", vals; prev=$0; cnt=vals="" }
    { vals=vals $0 ORS; cnt   }
    END { if (cnt<3) printf "%s", vals }
'
banana
banana
cherry
pear

The output of sort has all the values grouped together so you only need to look at the count when the values change to know how many of the previous value there were. sort still has to consider the whole input but it's designed to handle massive files by using demand paging, etc. and so is far more likely to be able to handle huge files than reading it all into memory in awk.

If you do care about output order you could use a DSU approach, see How to sort data based on the value of a column for part (multiple lines) of a file?