I am trying to get a specific pattern from a text file. The pattern should start with chr followed by a digit appearing at most 2 times, or a letter X or Y appearing exactly 1 time, and then an underscore appearing also one time. Input example:
chr5_ 16560869
chrX 46042911
chr12_ 131428407
chr22_ 13191864
chr5 165608
chrX_ 96055593
I am running this code on the console: grep -P "^chr(\d{,2}|X{1}|Y{1})_" input_file.txt
, which only gives me back the lines that start with chrX_ or chrY_, but not chr2_ (I wrote 2 but could be any digit/s).
The thing is that if I run grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt
(note that I changed from {,2} to {1,2}) then I get back what I expected. I can't figure out why the first option is not working. I thought in regexp you could specify that a pattern was matched at most n times with the syntax {,N}.
Thanks in advance!
CodePudding user response:
Note that you chose the PCRE regex engine with your grep
due to the -P
option.
The \d{,2}
does not match zero to two digits, it matches a digit and then a {,2}
string. See the regex demo.
See the PCRE documentation:
An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example,
{,6}
is not a quantifier, but a literal string of four characters.
Also, see the limiting quantifier definition there:
The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second.
Note that POSIX regex flavor is not that strict, it allows omitting the minimum threshold value from the limiting quantifier (then it is assumed to be 0
):
grep -oE '[0-9]{,2}_' <<< "12_ 21"
## => 12_
grep -oP '[0-9]{,2}_' <<< "21_ 1{,2}_"
## => 1{,2}_
See the online demo.
Note
I'd advise to always specify the 0
min value since the behavior varies from engine to engine. In TRE regex flavor used in R as the default base R regex engine, omitting the zero leads to a bug.
CodePudding user response:
There is a missing digit in the {,2}
.
Give a try to :
grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt
My first guess was to use egrep
instead.
This other seems ok too:
egrep "^chr([[:digit:]]{1,2}|X{1}|Y{1})_" input_file.txt
CodePudding user response:
-P, --perl-regexp Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
It would seem {,n}
is not compatable in perl
regex.
Using grep
ERE instead
$ grep -E 'chr([0-9]{,2}|[XY])_' input_file
chr5_ 16560869
chr12_ 131428407
chr22_ 13191864
chrX_ 96055593