Home > Mobile >  Grep pattern matching at most n times using Perl flag
Grep pattern matching at most n times using Perl flag

Time:07-22

I am trying to get a specific pattern from a text file. The pattern should start with chr followed by a digit appearing at most 2 times, or a letter X or Y appearing exactly 1 time, and then an underscore appearing also one time. Input example:

chr5_   16560869
chrX    46042911
chr12_  131428407
chr22_  13191864
chr5    165608
chrX_   96055593

I am running this code on the console: grep -P "^chr(\d{,2}|X{1}|Y{1})_" input_file.txt, which only gives me back the lines that start with chrX_ or chrY_, but not chr2_ (I wrote 2 but could be any digit/s).

The thing is that if I run grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt (note that I changed from {,2} to {1,2}) then I get back what I expected. I can't figure out why the first option is not working. I thought in regexp you could specify that a pattern was matched at most n times with the syntax {,N}.

Thanks in advance!

CodePudding user response:

Note that you chose the PCRE regex engine with your grep due to the -P option.

The \d{,2} does not match zero to two digits, it matches a digit and then a {,2} string. See the regex demo.

See the PCRE documentation:

An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.

Also, see the limiting quantifier definition there:

The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second.

Note that POSIX regex flavor is not that strict, it allows omitting the minimum threshold value from the limiting quantifier (then it is assumed to be 0):

grep -oE  '[0-9]{,2}_' <<< "12_ 21"
## => 12_

grep -oP  '[0-9]{,2}_' <<< "21_ 1{,2}_"
## => 1{,2}_

See the online demo.

Note

I'd advise to always specify the 0 min value since the behavior varies from engine to engine. In TRE regex flavor used in R as the default base R regex engine, omitting the zero leads to a bug.

CodePudding user response:

There is a missing digit in the {,2}.

Give a try to :

grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt

My first guess was to use egrep instead.

This other seems ok too:

egrep "^chr([[:digit:]]{1,2}|X{1}|Y{1})_" input_file.txt

CodePudding user response:

-P, --perl-regexp Interpret PATTERNS as Perl-compatible regular expressions (PCREs).

It would seem {,n} is not compatable in perl regex.

Using grep ERE instead

$ grep -E 'chr([0-9]{,2}|[XY])_' input_file
chr5_   16560869
chr12_  131428407
chr22_  13191864
chrX_   96055593
  • Related