How to sort and uniq for the lines seperated by specific character using BASH-CodePudding

I found this a bit challenging. I want to consider line pairs and perform sort and uniq functions on them to see what are the varieties of those.

My data looks like below:

seq300_fw
seq300_rv_rc
--
seq140_fw
seq140_rv_rc
--
seq12_fw
seq12_rv_rc
--
seq140_fw
seq140_rv_rc
--
seq140_fw_rc
seq140_rv

What I want to do is to know what are the varieties of the paired lines between the -- as separator.

The desired output should be:

1
seq300_fw
seq300_rv_rc
--
2
seq140_fw
seq140_rv_rc
--
1
seq12_fw
seq12_rv_rc
--
1
seq140_fw_rc
seq140_rv

So far I have not been able to come up with a command-line for this issue - any help is appreciated.

CodePudding user response：

Using any awk in any shell on every Unix box:

$ cat tst.awk
BEGIN { inSep="--" }
$0 == inSep {
    cnt[rec]  
    rec = ""
    next
}
{ rec = rec ORS $0 }
END {
    cnt[rec]  
    outSep = ""
    for (rec in cnt) {
        print outSep cnt[rec] rec
        outSep = inSep ORS
    }
}

$ awk -f tst.awk file
1
seq12_fw
seq12_rv_rc
--
1
seq300_fw
seq300_rv_rc
--
2
seq140_fw
seq140_rv_rc
--
1
seq140_fw_rc
seq140_rv

CodePudding user response：

here is a working prototype, some formatting tweaks are needed...

$ awk -v RS='\n--\n' -v ORS='\n--\n' '!c[$0]  {a[NR]=$0} 
         END{n=asort(a); for(i=1;i<=n;i  ) print c[a[i]] "\n" a[i]}' file


1
seq12_fw
seq12_rv_rc
--
2
seq140_fw
seq140_rv_rc
--
1
seq140_fw_rc
seq140_rv

--
1
seq300_fw
seq300_rv_rc
--

that's due to the missing last record separator, the easiest fix will be just appending it to the input file, so instead of file replace with <(cat file; echo '--')