I have a peculiar problem. A text file contains the following three lines,
>chain A
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain B
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain C
MGRDQRAVAGPALRRWLLLGTVTVGFLAQSVLAGVKKFDVPCGGRDCSGGCQCYPEKGGRGQPGPVGPQGYNGPPGLQGFPGLQGRKGDKGERGAPGVTGPKGDVGARGVSGFPGADGIPGHPGQGGPRGRPGYDGCNGTQGDSGPQGPPGSEGFTGPPGPQGPKGQKGEP-YALPKEERDRYRGEPGEPGLVGFQGPPGRPGHVGQMGPVGAPGRPGPPGPPGPKGQQGNRGLGFYGVKGEKGDVGQPGPNGIPSDTLHPIIAPTGVTFHPDQYKGEKGSEGEPGIRGISLKGEEGIMGFPGLRGYPGLSGEKGSPGQKGSRGLDGYQGPDGPRGPKGEAGDPGPPGLP--AYSPHPSLAKGARGDPGFPGAQGEPGSQGEPGDPGLPGPPGLSIGDGDQRRGLPGEMGPKGFIGDPGIPALYGGPPGPDGKRGPPGPPGLPGPPGPDGFL-FGLKGAKGRAGFPGLPGSPGARGPKGWKGDAGECRCTEGDEAIKGLPGLPGPKGFAGINGEPGRKGDRGDPGQHGLPGFPGLKGVPGNIGAPGPKGAKGDS-RTITTKGERGQPGVPGVPGMKGDDGSPGRDGLDGFPGLPGPPGD-GIKGPPGDPGYPGIPGTKGTPGEMGPPGLGLPGLKGQRGFPGDAGLPGPPGFLGPPGPAGTPGQIDCDTDVKRAVGGDRQEAIQPGCIGGPKGLPGLPGPPGPTGAKGLRGIPGFAGADGGPGPRGLPGDAGREGFPGPPGFIGPRGSKGAVGLPGPDGSPGPIGLPGPDGPPGERGLPGEVLGAQPGPRGDAGVPGQPGLKGLPGDRGPPGFRGSQGMPGMPGLKGQPGLPGPSGQPGLYGPPGLHGFPGAPGQEGPLGLPGIPGREGLPGDRGDPGDTGAPGPVGMKGLSGDRGDAGFTGEQGHPGSPGFKGIDGMPGTPGLKGDRGSPGMDGFQGMPGLKGRPGFPGSKGEAGFFGIPGLKGLAGEPGFKGSRGDPGPPGPP-PVILPGMKDIKGEKGDEGPMGLKGYLGAKGIQGMPGIPGLSGIPGLPGRPGHIKGVKGDIGVPGIPGLPGFPGVAGPPGITGFPGFIGSRGDKGAPGRAGLYGEIGATGDFGDIGDT-INLPGRPGLKGERGTTGIPGLKGFFGEKGTEGDIGFPGITGVTGVQGPPGLKGQTGFPGLTGPPGSQGELGRIGLPGGKGDDGWPGAPGLPGFPGLRGIRGLHGLPGTKGFPGSPGSDIHGDPGFPGPPGERGDPGEANTLPGPVGVPGQKGDQGAPGERGPPGSPGLQGFPGITPPSNISGAPGDKGAPGIFGLKGYRGPPGPPGSAALPGSKGDTGNPGAPGTPGTKGWAGDSGPQGRPGVFGLPGEKGPRGEQGFMGNTGPTGAVGDRGPKGPKGDPGFPGAPGTVGAPGIAGIPQKIAVQPGTVGPQGRRGPPGAPGEMGPQGPPGEPGFRGAPGKAGPQGRGGVSAVPGFRGDEGPIGHQGPIGQEGAPGRPGSPGLPGMPGR-SVSIGYLLVKHSQTDQEPMCPVGMNKLWSGYSLLYFEGQEKAHNQDLGLAGSCLARFSTMPFLYCNPGDVCYYASRNDKSYWLSTTAPLP--MMPVAEDEIKPYISRCSVCEAPAIAIAVHSQDVSIPHCPAGWRSLWIGYSFLMHTAAGDEGGGQSLVSPGSCLEDFRATPFIECNGGRGTCHYYANKYSFWLTTIPEQSFQGSPSADTLKAGLIRTHISRCQVCMKNL
Now, I want to find out the character position of all R and D/E in the three chains that satisfy the following relationship
Ri (chain A) - Di 2 (chain B)
Ri (chain B) - Di 2 (chain C)
Ri (chain C) - Di 5 (chain A)
Explanation: Iterate over every ith R in chain A and check if the i 2 position of chain B contains D or E. If yes, output the character positions of every such R and D/E pair. Do the same with chains B C and chains C A.
I tried to the following:
IFS=$'\n' read -d '' -r -a lines <file.txt
echo "${lines[1]}" | awk '{for(i=1;i<=length($0);i ) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[3]}" | awk '{for(i=1;i<=length($0);i ) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[5]}" | awk '{for(i=1;i<=length($0);i ) {if (substr($0,i,1)=="R") {print i}}}'
but this will give positions of R or E in the lines but not constrained by the relationship.
CodePudding user response:
this can be optimized but I think works... Prints the chains compared, the position of the first match and the matched chars. Assumes chains are the same length and doesn't check for bounds. Iterates each sequence once and compares with the other two for offset match.
Note that A and B sequences are the same, so for C-A and C-B comparisons you'll get the same results.
$ awk 'function charAt(_d, _i) {return substr(_d,_i,1)}
NR%2 {chain[int(NR/2) 1]=$2; next}
{d[NR/2]=$0}
END {nc=NR/2;
for(i=1;i<=nc;i )
for(j=1;j<=length(d[i]);j ) {
os=j (chain[i]=="C"?5:2);
if( (c1=charAt(d[i],j))=="R") {
if( (c2=charAt(d[k=i%nc 1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
if( (c2=charAt(d[k=(i 1)%nc 1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
}}}' file
A-C 187 R E
A-C 365 R D
A-B 374 R E
A-C 374 R E
A-B 409 R E
A-C 415 R D
A-C 521 R D
A-C 606 R D
A-B 618 R D
A-B 829 R E
A-B 967 R D
A-C 967 R E
A-B 1018 R D
A-B 1114 R E
A-C 1114 R E
A-C 1224 R D
A-B 1569 R D
A-C 1569 R D
A-B 1692 R E
B-C 187 R E
B-C 365 R D
B-C 374 R E
B-A 374 R E
B-A 409 R E
B-C 415 R D
B-C 521 R D
B-C 606 R D
B-A 618 R D
B-A 829 R E
B-C 967 R E
B-A 967 R D
B-A 1018 R D
B-C 1114 R E
B-A 1114 R E
B-C 1224 R D
B-C 1569 R D
B-A 1569 R D
B-A 1692 R E
C-A 335 R E
C-B 335 R E
C-A 403 R E
C-B 403 R E
C-A 475 R E
C-B 475 R E
C-A 746 R D
C-B 746 R D
C-A 1236 R E
C-B 1236 R E
C-A 1600 R E
C-B 1600 R E