How to sort sets of lines like paragraph in bash by the content of it's first line?-CodePudding

I would like to sort various paragraphs in a file by alphabetical order according to the first line:

Hampton  
this is good  
(mind the mail)

Burlington  
I'm fine

Greater Yukonshire Fields  
(empty)

Those blocks of text might consist of one or more lines, but are seperated by one or more blank lines.

Desired result:

Burlington 
I'm fine

Greater Yukonshire Fields 
(empty)


Hampton 
this is good 
(mind the mail)

CodePudding user response：

One GNU awk idea:

awk 'BEGIN { RS="" } 
           { a[FNR]=$0 }
     END   { PROCINFO["sorted_in"]="@val_str_asc"
             for (i in a)
                 print a[i] ORS
           }
' paragraphs

NOTE: requires GNU awk for PROCINFO["sorted_in"]

This generates:

Burlington
I'm fine

Greater Yukonshire Fields
(empty)

Hampton
this is good
(mind the mail)

CodePudding user response：

Would you please try msort, which will be available for most Linux distributions:

msort -bwq file

Output:

Burlington  
I'm fine

Greater Yukonshire Fields  
(empty)

Hampton  
this is good  
(mind the mail)

Options:

-b A record is terminated by two or more newlines
-w Sort on the entire text of the record
-q Be quiet - do not chat while working

CodePudding user response：

Using perl:

$ perl -00 -lne '
  push @paras, [ substr($_, 0, index($_, "\n")), $_ ];
  END {
    for my $para (sort { $a->[0] cmp $b->[0] } @paras) {
      print $para->[1]
    }
  }' input.txt
Burlington
I'm fine

Greater Yukonshire Fields
(empty)

Hampton
this is good
(mind the mail)

The -00 option reads in "paragraph mode" instead of lines, where multiple newlines separate a paragraph. For each paragraph, it extracts the first line and saves it and the paragraph in a list, and then after reading the entire file, sorts based on the first line and prints the paragraphs.

CodePudding user response：

Using awk:

One way reading linewise:

awk '
  {if (NF) a[p]=(a[p] $0 ORS); else p  }           # Collect
  END {asort(a); for (i in a) print a[i]}          # Sort and Output
' input.txt

Another way reading paragraphwise:

awk -v RS='\n{2,}' '
  {a[FNR]=$0}                                      # Collect
  END {asort(a); for (i in a) print a[i] ORS}      # Sort and Output
' input.txt

Output

Burlington  
I'm fine

Greater Yukonshire Fields  
(empty)

Hampton  
this is good  
(mind the mail)

Both collect concatenated lines in an array. This is then sorted and output.

CodePudding user response：

An approach using ruby.

First initialize a counter i and a 2-dimensional array arr, then append the lines $_
If it finds an empty line increment the counter
Append a newline to the last paragraph (last line didn't have one)
Finally print the sorted array

% ruby -ne 'i ||= 0; arr ||= []; arr[i] ||= []; arr[i] << $_
            i  = 1 if $_.length == 1
            END{ arr[i] << "" 
                 puts arr.sort }' file      
Burlington  
I'm fine

Greater Yukonshire Fields  
(empty)

Hampton  
this is good  
(mind the mail)

CodePudding user response：

Using any awk sort and assuming you dont have any \rs in your data:

$ awk -v RS= -F'\n' -v OFS='\r' '{$1=$1}1' file |
    sort |
    awk -v ORS='\n\n' -F'\r' -v OFS='\n' '{$1=$1}1'
Burlington
I'm fine

Greater Yukonshire Fields
(empty)

Hampton
this is good
(mind the mail)

We're just joining lines of each paragraph together with the first awk, then sorting it, then breaking the lines apart again:

$ awk -v RS= -F'\n' -v OFS='\r' '{$1=$1}1' file | cat -Ev
Hampton  ^Mthis is good  ^M(mind the mail)$
Burlington  ^MI'm fine$
Greater Yukonshire Fields  ^M(empty)$

$ awk -v RS= -F'\n' -v OFS='\r' '{$1=$1}1' file | sort | cat -Ev
Burlington  ^MI'm fine$
Greater Yukonshire Fields  ^M(empty)$
Hampton  ^Mthis is good  ^M(mind the mail)$

The pipe to cat -Ev is just so you can see the otherwise invisible CR aka \r aka ^Ms.