Home > Blockchain >  KEEP duplicated lines using grep
KEEP duplicated lines using grep

Time:11-13

I am using grep to search a file1 for patterns that are inside file2, this file2 has some duplicated lines, and I can only get them once. How to keep duplicated lines using grep?

file1.txt

2L  FlyBase mRNA    11009821    11011086    .   -   .   ID=transcript:FBtr0080162;Parent=gene:FBgn0032329;Name=Art8-RA;biotype=protein_coding;transcript_id=FBtr0080162

2L  FlyBase ncRNA   11011162    11012135    .   -   .   ID=transcript:FBtr0346761;Parent=gene:FBgn0267425;Name=asRNA:CR45778-RA;biotype=ncRNA;transcript_id=FBtr0346761

2L  FlyBase mRNA    11011312    11012135    .       .   ID=transcript:FBtr0080156;Parent=gene:FBgn0250837;Name=dUTPase-RB;biotype=protein_coding;transcript_id=FBtr0080156

2L  FlyBase mRNA    11011312    11012135    .       .   ID=transcript:FBtr0331195;Parent=gene:FBgn0250837;Name=dUTPase-RC;biotype=protein_coding;transcript_id=FBtr0331195

2L  FlyBase mRNA    11011312    11012135    .       .   ID=transcript:FBtr0080157;Parent=gene:FBgn0250837;Name=dUTPase-RA;biotype=protein_coding;transcript_id=FBtr0080157

2L  FlyBase mRNA    67043   71081   .       .   ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536

2L  FlyBase mRNA    67043   71390   .       .   ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100

file2.txt

FBtr0306536

FBtr0078100

FBtr0306536

FBtr0078100

My code: grep 'ID=transcript:' file1.txt | grep -w -f file2.txt

2L  FlyBase mRNA    67043   71081   .       .   ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536

2L  FlyBase mRNA    67043   71390   .       .   ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100

--->However, I wish I could have this result:

2L  FlyBase mRNA    67043   71081   .       .   ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536

2L  FlyBase mRNA    67043   71390   .       .   ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100

2L  FlyBase mRNA    67043   71081   .       .   ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536

2L  FlyBase mRNA    67043   71390   .       .   ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100

CodePudding user response:

Assuming this is a bash shell, if another tool such as awk is an option, it may provide an easier solution

$ awk -F"[:;]" 'NR==FNR{array[$2]=$0; next} {print array[$0]}' file1 file2
2L FlyBase mRNA 67043 71081 .   . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536

2L FlyBase mRNA 67043 71390 .   . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100

2L FlyBase mRNA 67043 71081 .   . ID=transcript:FBtr0306536;Parent=gene:FBgn0067779;Name=dbr-RC;biotype=protein_coding;transcript_id=FBtr0306536

2L FlyBase mRNA 67043 71390 .   . ID=transcript:FBtr0078100;Parent=gene:FBgn0067779;Name=dbr-RB;biotype=protein_coding;transcript_id=FBtr0078100

By utilizing two delimiters : and ;, you can isolate the gene IDs into column 2, create an array and match them in the second file.

CodePudding user response:

how about doing it using a loop?

while read line
do
 grep "$line" file1.txt
done < file2.txt

Should work assuming you do not have empty lines between the FBtr IDs in file2.txt (in your question, there are empty lines)

  • Related