How to remove 1 instance of each (identical) line in a text file in Linux?-CodePudding

There is a file:

Mary 
Mary 
Mary 
Mary 
John 
John 
John 
Lucy 
Lucy 
Mark

I need to get

Mary 
Mary 
Mary 
John 
John 
Lucy

I cannot get the lines ordered according to how many times each line is repeated in the text, i.e. the most frequently occurring lines must be listed first.

CodePudding user response：

If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time". Then a possible (and efficient) awk solution would be:

awk 'prev==$0{print}{prev=$0}'

or if you prefer an approach that looks more familiar if coming from other programming languages:

awk '{if(prev==$0)print;prev=$0}'

Partially working solutions below. I'll keep them for reference, maybe they are helpful to somebody else.

If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.

awk '
{ lines[$0]   }
END {
  for (line in lines) {
    for (i = 1; i < lines[line];   i) {
      print line
    }
  }
}
'

Since you mentioned that the most frequent line must come first, you have to sort first:

sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count;  i){$1="";print}}' | cut -c2-

Note that the latter will reformat your lines (e.g. collapsing/squeezing repeated spaces). See Is there a way to completely delete fields in awk, so that extra delimiters do not print?

CodePudding user response：

don't sort for no reason :

nawk '_[$-__]--'

gawk '__[$_]  ' 
mawk '__[$_]  '

Mary 
Mary 
Mary 
John 
John 
Lucy

for 1 GB files, u can speed things up a bit by preventing FS from splitting unnecessary fields

mawk2 '__[$_]  ' FS='\n'