There is a file:
Mary
Mary
Mary
Mary
John
John
John
Lucy
Lucy
Mark
I need to get
Mary
Mary
Mary
John
John
Lucy
I cannot get the lines ordered according to how many times each line is repeated in the text, i.e. the most frequently occurring lines must be listed first.
CodePudding user response:
If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time". Then a possible (and efficient) awk solution would be:
awk 'prev==$0{print}{prev=$0}'
or if you prefer an approach that looks more familiar if coming from other programming languages:
awk '{if(prev==$0)print;prev=$0}'
Partially working solutions below. I'll keep them for reference, maybe they are helpful to somebody else.
If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.
awk '
{ lines[$0] }
END {
for (line in lines) {
for (i = 1; i < lines[line]; i) {
print line
}
}
}
'
Since you mentioned that the most frequent line must come first, you have to sort first:
sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count; i){$1="";print}}' | cut -c2-
Note that the latter will reformat your lines (e.g. collapsing/squeezing repeated spaces). See Is there a way to completely delete fields in awk, so that extra delimiters do not print?
CodePudding user response:
don't sort
for no reason :
nawk '_[$-__]--' gawk '__[$_] ' mawk '__[$_] '
Mary
Mary
Mary
John
John
Lucy
for 1 GB
files, u can speed things up a bit by preventing FS
from splitting unnecessary fields
mawk2 '__[$_] ' FS='\n'