I am taking a fourth year bioinformatics course. In this current assignment, the prof has given us a gff file with all the miRNA genes in the human genome annotated as gene-MIR. We are supposed to use grep, along with a regular expression and other command-line tools to generate a list of unique miRNA names in the human genome. It seems fairly straight forward and I understand how to do most of it. But I am having trouble sorting the file and removing the repeated lines. We are supposed to do this in one command line, but I am having trouble doing so.
This is the grep command I used to generate a list of gene-MIR names:
grep -Eo "(\gene-MIR)\d*\w*" file.gff
But this only generates a huge list with multiple repeats. So I tried:
grep -Eo "(\gene-MIR)\d*\w*" file.gff > file2 | sort < file2 | uniq -c > file3
But this did not work either. I have tried many variations of the above, but I unsure of what to do next.
Can anyone offer any help/advice?
CodePudding user response:
You can use
grep -o 'gene-MIR[[:alnum:]_]*' file.gff | sort -u > file3
Details:
-o
- outputs matched texts onlygene-MIR[[:alnum:]_]*
- regex matchinggene-MIR
and then any zero or more "word" chars, letters, digits or underscores (as\w
is not supported universally)sort -u
sorts and keep only unique entries.