I am using grep to search a file1 for patterns that are inside file2, this file2 has some duplicated lines, and I can only get them once. How to keep duplicated lines using grep?
file1.txt
2L FlyBase mRNA 11009821 11011086 . - . ID=transcript:FBtr0080162;Parent=gene:FBgn0032329;Name=Art8-RA;biotype=protein_coding;transcript_id=FBtr0080162
2L FlyBase ncRNA 11011162 11012135 . - . ID=transcript:FBtr0346761;Parent=gene:FBgn0267425;Name=asRNA:CR45778-RA;biotype=ncRNA;transcript_id=FBtr0346761
2L FlyBase mRNA 11011312 11012135 . . ID=transcript:FBtr0080156;Parent=gene:FBgn0250837;Name=dUTPase-RB;biotype=protein_coding;transcript_id=FBtr0080156
2L FlyBase mRNA 11011312 11012135 . . ID=transcript:FBtr0331195;Parent=gene:FBgn0250837;Name=dUTPase-RC;biotype=protein_coding;transcript_id=FBtr0331195
2L FlyBase mRNA 11011312 11012135 . . ID=transcript:FBtr0080157;Parent=gene:FBgn0250837;Name=dUTPase-RA;biotype=protein_coding;transcript_id=FBtr0080157
2L FlyBase mRNA 67043 71081 . . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536
2L FlyBase mRNA 67043 71390 . . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100
file2.txt
FBtr0306536
FBtr0078100
FBtr0306536
FBtr0078100
My code:
grep 'ID=transcript:' file1.txt | grep -w -f file2.txt
2L FlyBase mRNA 67043 71081 . . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536
2L FlyBase mRNA 67043 71390 . . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100
--->However, I wish I could have this result:
2L FlyBase mRNA 67043 71081 . . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536
2L FlyBase mRNA 67043 71390 . . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100
2L FlyBase mRNA 67043 71081 . . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536
2L FlyBase mRNA 67043 71390 . . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100
CodePudding user response:
Assuming this is a bash
shell, if another tool such as awk
is an option, it may provide an easier solution
$ awk -F"[:;]" 'NR==FNR{array[$2]=$0; next} {print array[$0]}' file1 file2
2L FlyBase mRNA 67043 71081 . . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536
2L FlyBase mRNA 67043 71390 . . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100
2L FlyBase mRNA 67043 71081 . . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536
2L FlyBase mRNA 67043 71390 . . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100
By utilizing two delimiters : and ;
, you can isolate the gene IDs into column 2, create an array and match them in the second file.
CodePudding user response:
how about doing it using a loop?
while read line
do
grep "$line" file1.txt
done < file2.txt
Should work assuming you do not have empty lines between the FBtr IDs in file2.txt (in your question, there are empty lines)