Hope you're well. I have found questions that come close to what I'm asking but none exactly. I've been struggling on this for the last 2 weeks and have finally managed to make a bit of progress enough that I feel I can justify asking on here!
I have downloaded a HitTable from a sequence I BLASTED
accno percent seqstart seqend
AC020656|33 84.713 116580 116735
AC020656|33 90.303 118279 118443
AC020656|33 87.654 120390 121470
AC020656|33 82.609 121323 121390
AC123694|11 77.622 158333 158474
AC123694|11 84.559 158238 160142
What I am trying to do is find out is which of these hits for each entry (accession number) are overlapping and which are not so I can carry on with the rest of my pipeline. I'm aiming to put the IDs (or whole row, I'm not fussed at this point!) into separate files and then use them to extract the corresponding FASTA files so I can apply the more appropriate programme depending on if it's overlapping or not.
I have felt the best way to go about it is using awk
to:
- CHECK that the IDs ($1) are the same (no use checking together AC020656|33 line 4 and AC123694|11 line 4!)
- COMPARE if the seq end ($4) is smaller than the seq start ($3) of the next row. If so, print it in a file called "nonover.txt", else print to "overlap.txt"
I began an attempt at this answer by modifying code found HERE:
awk '($1==c1 && $3==c3 && $4==c4){print line RS $0}{line=$0;c1=$1;c4<$3}' mydata.txt
But, no surprise, it doesn't work as I am clearly missing something. The OP in the answer I linked very kindly explained it in such a way that I felt confident tweaking it but well, thats my incompetence!
I've also tried to apply my problem using code found HERE and HERE which I've felt have come close to what I'm trying to do. I've also checked out the manual for awk and while it did help a bit (I tried using the getline
function but just kept getting smacked with errors), I just honestly don't think I am well versed enough to tackle this issue at the minute.
And as pointed out by markp-fuso, my intended output for the above would ideally be two files with the following data inside:
noverlap.txt (as each rows seqend is smaller than the next rows seqstart, therefore it's not overlapped)
accno percent seqstart seqend
AC020656|33 84.713 116580 116735
AC020656|33 90.303 118279 118443
AC020656|33 87.654 120390 121470
overlap.txt (as each rows seqend is larger than the next rows seqstart, and it is overlapping)
accno percent seqstart seqend
AC020656|33 87.654 120390 121470
AC020656|33 82.609 121323 121390
AC123694|11 77.622 158333 158474
AC123694|11 84.559 158238 160142
As pointed out by Ed Morton, how should acc.no be treated if some entries overlap but others don't - it's fine for them to be split up with some entries in both noverlap.txt and overlap.txt. I will be checking for any identical acc.no between the two folders, will treat the overlap first to then add to the nonoverlapping entries and will carry on from there. Duplicates are fine here (see AC020656|33 87.654 120390 121470
in both txt files), I know how I'm treating these, it's just so I can confirm the method to use on my real data.
TL;DR: Using grouping based on id (Acc. no), can I compare data from a column with data in a different column and the row below? Happy with either loop, script or one/two line answers suitable for a OS user
Thank you kindly in advance, any advice is very welcome and I appreciate you taking the time out to read/answer my question.
UPDATE: Thanks to the wonderful Ed Morton for his solution that has done the trick perfectly. I am just adding what I am doing to remove the single, duplicated entries found in the non-overlapping txt file (but found where they should be in overlap) which is modifying the code found at this answer HERE
CodePudding user response:
This will produce the provided expected output from the provided sample input:
$ cat tst.awk
{ sub(/\r$/,"") }
NR == 1 { hdr = $0 }
NR > 2 { prt() }
{ prev = $0 }
function prt( over, noover, p, out) {
over = "overlap.txt"
noover = "noover.txt"
if ( !doneHdr ) {
print hdr > over
print hdr > noover
}
split(prev,p)
if ( ($1 == p[1]) && ($3 <= p[4]) ) {
print prev > over
print $0 > over
print "" > over
}
else {
print prev > noover
}
}
$ awk -f tst.awk file
$ head *over*
==> noover.txt <==
accno percent seqstart seqend
AC020656|33 84.713 116580 116735
AC020656|33 90.303 118279 118443
AC020656|33 82.609 121323 121390
==> overlap.txt <==
accno percent seqstart seqend
AC020656|33 87.654 120390 121470
AC020656|33 82.609 121323 121390
AC123694|11 77.622 158333 158474
AC123694|11 84.559 158238 160142
If that's not all you need then edit your question to provide more truly representative sample input/output that includes cases that the above doesn't work for.
Note that the above requires at least 2 data lines in addition to the header to be present in the input. If there was only 1 data line it wouldn't be printed. If that's an issue add some logic to print the hdr
and prev
in an END section if NR
is less than 3 or similar.