Home > Net >  find characters present in two different lines if they satisfy a positional relationship
find characters present in two different lines if they satisfy a positional relationship

Time:10-08

I have a peculiar problem. A text file contains the following three lines,

>chain A
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain B
---------MGPRLSVWLLLLPAALLLHEEHSRAAA--KGGCAGSGC-GKCDCHGVKGQKGERGLPGLQGVIGFPGMQGPEGPQGPPGQKGDTGEPGLPGTKGTRGPPGASGYPGNPGLPGIPGQDGPPGPPGIPGCNGTKGERGPLGPPGLPGFAGNPGPPGLPGMKGDPGEILGHVPGMLLKGERGFPGIPGTPGPPGLPGLQGPVGPPGFTGPPGPPGPPGPPGEKGQMGLSFQGPKGDKGDQGVSGPPGVPGQA-------QVQEKGDFATKGEKGQKGEPGFQGMPGVGEKGEPGKPGPRGKPGKDGDKGEKGSPGFPGEPGYPGLIGRQGPQGEKGEAGPPGPPGIVIGTGPLGEKGERGYPGTPGPRGEPGPKGFPGLPGQPGPPGLPVPGQAGAPGFPGERGEKGDRGFPGTS-LP-GPSGRDGLPGPPGSPGPPGQPGYTNGIVECQPGPPGDQGPPGIPGQPGFIGEIGEKGQKGESCLICDIDGYRGPPGPQGPPGEIGFPGQPGAKGDRGLPGRDGVAGVPGPQGTPGLIGQPGAKGEPGEFYFDLRLKGDKGDPGFPGQPGMPGRAGSPGRDGHPGLPGPKGSPGSVGLKGERGPPGGVGFPGSRGDTGPPGPPGY---GPAGPIGDKGQAGFPGGPGSPGLPGPKGEPGKIVP--------------------LPGPPGAEGLPGSPGFPGPQGDRGFPGTPGRPGLPGEKGAVGQPGI-GFPGPPGPKGVDGLPGDMGPPGTPGRPGFNGLPGNPGVQGQKGEP---GVGLPGLKGLPGLPGIPGTPGEKGSIGVPGVPGEHGAIGPPGLQGIRGEPGPPGLPGSVGSPGVPGI-GPPGARGPPGGQGPPGLSGPPGIKGEKGFPGFPGLD-MPGPKGDKGAQGLPGITGQSGLPGLPGQQGAPGIPGFPGSKGEMGVMGTPGQPGSPGPVGAPGLPGEKGDHGFPGSSGPRGDPGLKGDKGDVGLPGKPGSMDKVDMGSMKGQKGDQGEKGQIGPIGEKGSRGDPGTPGVPGKDGQAGQPGQP-GPKGDPGISGTPGAPGLPGPKGSVGGMGLPGTPGEKGVPGIPGPQGSPGLPGDKGAKGEKGQAGPPGIGIPGLRGEKGDQGIAGFPGSPGEKGEKGSIGIPGMPGSPGLKGSPGSVGYPGSPGLPGEKGDKGLPGLDGIPGVKGEAGLPGTPGPTGPAGQKGEPGSDGIPGSAGEKGEPGLPGRGFPGFPGAKGDKGSKGEVGFP-GLAGSPGIPGSKGEQGFMGPPGPQGQPGLPGSPGHA-TEGPKGDRGPQGQPGLPGLPGPMGPPGLPGIDGVKGDKGNPGWPGAPGVPGPKGDPGFQGMPGIGGSPGITGSKGDMGPPGVPGFQGPKGLPGLQGIKGDQGDQGVPGAKGLPGPPGPPGPYDIIKGEPGLPGPEGPPGLKGLQGLPGPKGQQGVTGLVGIPGPPGIPGFDGAPGQKGEMGPAGPTGPRGFPGPPGPDGLPGSMGPPGTPSVDHGFLVTRHSQTIDDPQCPSGTKILYHGYSLLYVQGNERAHGQDLGTAGSCLRKFSTMPFLFCNINNVCNFASRNDYSYWLSTPEPMPMSMAPITGENIRPFISRCAVCEAPAMVMAVHSQTIQIPPCPSGWSSLWIGYSFVMHTSAGAEGSGQALASPGSCLEEFRSAPFIECHG-RGTCNYYANAYSFWLATIERSEMFKKPTPSTLKAGELRTHVSRCQVCMRRT
>chain C
MGRDQRAVAGPALRRWLLLGTVTVGFLAQSVLAGVKKFDVPCGGRDCSGGCQCYPEKGGRGQPGPVGPQGYNGPPGLQGFPGLQGRKGDKGERGAPGVTGPKGDVGARGVSGFPGADGIPGHPGQGGPRGRPGYDGCNGTQGDSGPQGPPGSEGFTGPPGPQGPKGQKGEP-YALPKEERDRYRGEPGEPGLVGFQGPPGRPGHVGQMGPVGAPGRPGPPGPPGPKGQQGNRGLGFYGVKGEKGDVGQPGPNGIPSDTLHPIIAPTGVTFHPDQYKGEKGSEGEPGIRGISLKGEEGIMGFPGLRGYPGLSGEKGSPGQKGSRGLDGYQGPDGPRGPKGEAGDPGPPGLP--AYSPHPSLAKGARGDPGFPGAQGEPGSQGEPGDPGLPGPPGLSIGDGDQRRGLPGEMGPKGFIGDPGIPALYGGPPGPDGKRGPPGPPGLPGPPGPDGFL-FGLKGAKGRAGFPGLPGSPGARGPKGWKGDAGECRCTEGDEAIKGLPGLPGPKGFAGINGEPGRKGDRGDPGQHGLPGFPGLKGVPGNIGAPGPKGAKGDS-RTITTKGERGQPGVPGVPGMKGDDGSPGRDGLDGFPGLPGPPGD-GIKGPPGDPGYPGIPGTKGTPGEMGPPGLGLPGLKGQRGFPGDAGLPGPPGFLGPPGPAGTPGQIDCDTDVKRAVGGDRQEAIQPGCIGGPKGLPGLPGPPGPTGAKGLRGIPGFAGADGGPGPRGLPGDAGREGFPGPPGFIGPRGSKGAVGLPGPDGSPGPIGLPGPDGPPGERGLPGEVLGAQPGPRGDAGVPGQPGLKGLPGDRGPPGFRGSQGMPGMPGLKGQPGLPGPSGQPGLYGPPGLHGFPGAPGQEGPLGLPGIPGREGLPGDRGDPGDTGAPGPVGMKGLSGDRGDAGFTGEQGHPGSPGFKGIDGMPGTPGLKGDRGSPGMDGFQGMPGLKGRPGFPGSKGEAGFFGIPGLKGLAGEPGFKGSRGDPGPPGPP-PVILPGMKDIKGEKGDEGPMGLKGYLGAKGIQGMPGIPGLSGIPGLPGRPGHIKGVKGDIGVPGIPGLPGFPGVAGPPGITGFPGFIGSRGDKGAPGRAGLYGEIGATGDFGDIGDT-INLPGRPGLKGERGTTGIPGLKGFFGEKGTEGDIGFPGITGVTGVQGPPGLKGQTGFPGLTGPPGSQGELGRIGLPGGKGDDGWPGAPGLPGFPGLRGIRGLHGLPGTKGFPGSPGSDIHGDPGFPGPPGERGDPGEANTLPGPVGVPGQKGDQGAPGERGPPGSPGLQGFPGITPPSNISGAPGDKGAPGIFGLKGYRGPPGPPGSAALPGSKGDTGNPGAPGTPGTKGWAGDSGPQGRPGVFGLPGEKGPRGEQGFMGNTGPTGAVGDRGPKGPKGDPGFPGAPGTVGAPGIAGIPQKIAVQPGTVGPQGRRGPPGAPGEMGPQGPPGEPGFRGAPGKAGPQGRGGVSAVPGFRGDEGPIGHQGPIGQEGAPGRPGSPGLPGMPGR-SVSIGYLLVKHSQTDQEPMCPVGMNKLWSGYSLLYFEGQEKAHNQDLGLAGSCLARFSTMPFLYCNPGDVCYYASRNDKSYWLSTTAPLP--MMPVAEDEIKPYISRCSVCEAPAIAIAVHSQDVSIPHCPAGWRSLWIGYSFLMHTAAGDEGGGQSLVSPGSCLEDFRATPFIECNGGRGTCHYYANKYSFWLTTIPEQSFQGSPSADTLKAGLIRTHISRCQVCMKNL

Now, I want to find out the character position of all R and D/E in the three chains that satisfy the following relationship

Ri (chain A) - Di 2 (chain B)
Ri (chain B) - Di 2 (chain C)
Ri (chain C) - Di 5 (chain A)

Explanation: Iterate over every ith R in chain A and check if the i 2 position of chain B contains D or E. If yes, output the character positions of every such R and D/E pair. Do the same with chains B C and chains C A.

I tried to the following:

IFS=$'\n' read -d '' -r -a lines <file.txt

echo "${lines[1]}" | awk '{for(i=1;i<=length($0);i  ) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[3]}" | awk '{for(i=1;i<=length($0);i  ) {if (substr($0,i,1)=="R") {print i}}}'
echo "${lines[5]}" | awk '{for(i=1;i<=length($0);i  ) {if (substr($0,i,1)=="R") {print i}}}'

but this will give positions of R or E in the lines but not constrained by the relationship.

CodePudding user response:

this can be optimized but I think works... Prints the chains compared, the position of the first match and the matched chars. Assumes chains are the same length and doesn't check for bounds. Iterates each sequence once and compares with the other two for offset match.

Note that A and B sequences are the same, so for C-A and C-B comparisons you'll get the same results.

$ awk 'function charAt(_d, _i) {return substr(_d,_i,1)}
 
     NR%2 {chain[int(NR/2) 1]=$2; next}
          {d[NR/2]=$0}

     END  {nc=NR/2;
           for(i=1;i<=nc;i  )
             for(j=1;j<=length(d[i]);j  ) {
               os=j (chain[i]=="C"?5:2);
               if( (c1=charAt(d[i],j))=="R") {
                   if( (c2=charAt(d[k=i%nc 1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
                   if( (c2=charAt(d[k=(i 1)%nc 1],os))=="D" || c2=="E") print chain[i]"-"chain[k],j,c1,c2;
            }}}' file
A-C 187 R E
A-C 365 R D
A-B 374 R E
A-C 374 R E
A-B 409 R E
A-C 415 R D
A-C 521 R D
A-C 606 R D
A-B 618 R D
A-B 829 R E
A-B 967 R D
A-C 967 R E
A-B 1018 R D
A-B 1114 R E
A-C 1114 R E
A-C 1224 R D
A-B 1569 R D
A-C 1569 R D
A-B 1692 R E
B-C 187 R E
B-C 365 R D
B-C 374 R E
B-A 374 R E
B-A 409 R E
B-C 415 R D
B-C 521 R D
B-C 606 R D
B-A 618 R D
B-A 829 R E
B-C 967 R E
B-A 967 R D
B-A 1018 R D
B-C 1114 R E
B-A 1114 R E
B-C 1224 R D
B-C 1569 R D
B-A 1569 R D
B-A 1692 R E
C-A 335 R E
C-B 335 R E
C-A 403 R E
C-B 403 R E
C-A 475 R E
C-B 475 R E
C-A 746 R D
C-B 746 R D
C-A 1236 R E
C-B 1236 R E
C-A 1600 R E
C-B 1600 R E
  • Related