I found this a bit challenging. I want to consider line pairs and perform sort and uniq functions on them to see what are the varieties of those.
My data looks like below:
seq300_fw
seq300_rv_rc
--
seq140_fw
seq140_rv_rc
--
seq12_fw
seq12_rv_rc
--
seq140_fw
seq140_rv_rc
--
seq140_fw_rc
seq140_rv
What I want to do is to know what are the varieties of the paired lines between the --
as separator.
The desired output should be:
1
seq300_fw
seq300_rv_rc
--
2
seq140_fw
seq140_rv_rc
--
1
seq12_fw
seq12_rv_rc
--
1
seq140_fw_rc
seq140_rv
So far I have not been able to come up with a command-line for this issue - any help is appreciated.
CodePudding user response:
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { inSep="--" }
$0 == inSep {
cnt[rec]
rec = ""
next
}
{ rec = rec ORS $0 }
END {
cnt[rec]
outSep = ""
for (rec in cnt) {
print outSep cnt[rec] rec
outSep = inSep ORS
}
}
$ awk -f tst.awk file
1
seq12_fw
seq12_rv_rc
--
1
seq300_fw
seq300_rv_rc
--
2
seq140_fw
seq140_rv_rc
--
1
seq140_fw_rc
seq140_rv
CodePudding user response:
here is a working prototype, some formatting tweaks are needed...
$ awk -v RS='\n--\n' -v ORS='\n--\n' '!c[$0] {a[NR]=$0}
END{n=asort(a); for(i=1;i<=n;i ) print c[a[i]] "\n" a[i]}' file
1
seq12_fw
seq12_rv_rc
--
2
seq140_fw
seq140_rv_rc
--
1
seq140_fw_rc
seq140_rv
--
1
seq300_fw
seq300_rv_rc
--
that's due to the missing last record separator, the easiest fix will be just appending it to the input file, so instead of file
replace with <(cat file; echo '--')