How can I use grep with pipe to sort uniq lines from a gff file-CodePudding

I am taking a fourth year bioinformatics course. In this current assignment, the prof has given us a gff file with all the miRNA genes in the human genome annotated as gene-MIR. We are supposed to use grep, along with a regular expression and other command-line tools to generate a list of unique miRNA names in the human genome. It seems fairly straight forward and I understand how to do most of it. But I am having trouble sorting the file and removing the repeated lines. We are supposed to do this in one command line, but I am having trouble doing so.

This is the grep command I used to generate a list of gene-MIR names:

grep -Eo "(\gene-MIR)\d*\w*" file.gff

But this only generates a huge list with multiple repeats. So I tried:

grep -Eo "(\gene-MIR)\d*\w*" file.gff > file2 | sort < file2 | uniq -c > file3

But this did not work either. I have tried many variations of the above, but I unsure of what to do next.

Can anyone offer any help/advice?

CodePudding user response：

You can use

grep -o 'gene-MIR[[:alnum:]_]*' file.gff | sort -u > file3

Details:

-o - outputs matched texts only
gene-MIR[[:alnum:]_]* - regex matching gene-MIR and then any zero or more "word" chars, letters, digits or underscores (as \w is not supported universally)
sort -u sorts and keep only unique entries.