I have the following strings as follows
This blue fox is [RULE-22]
The quick brown fox is like [RULE-1]
I am trying to find a regex that can help me fetching the number associated with the rule.
So far i have tried the following but this does't work
/[RULE]*\d/
The above regular expression returns me number for all the text enclosed by [] like following
this is blue fox [PLAY-22]
returns me 22
this is not working [RULE-44]
returns me 44
whereas i want this regex to return me number for only RULE
enclosed by []
Thanks
CodePudding user response:
Square brackets in a Regexp
denote a character class. A character class matches any of the characters in the class, i.e.
/[abc]/
is equivalent to
/a|b|c/
That means that in your Regexp
, this sub-part:
/[RULE]/
is equivalent to
/R|U|L|E/
The *
in your Regexp
is the so-called Kleene star, which specifically in Ruby Regexp
means zero or more repetitions.
The \d
in your Regexp
is an alternative notation for a pre-defined character class. In particular,
/\d/
is equivalent to
/[0-9]/
which is equivalent to
/[0123456789]/
which is equivalent to
/0|1|2|3|4|5|6|7|8|9/
i.e. it matching a single digit from 0 to 9.
Putting it all together, your Regexp
matches "any number of R
s, U
s, L
s, E
s, and S
s directly followed by a single digit".
Now, you might ask yourself: why does this match anything in your test data? In your test data, there is an ASCII hyphen (-
) before the digit, but that is not mentioned in the Regexp
anywhere? There is no R
, U
, L
, E
, or S
directly in front of the digit, so why is there a match?
Well, actually, there is "any number of R
s, U
s, L
s, E
s, and S
s" directly in front of the digit, because "any number of" includes zero! When matching a String
with a Regexp
, you can consider that between any two characters of the String
, there is an arbitrary number of empty String
s for the Regexp
to match.
So, your Regexp
matches the empty String
in between the -
and the digit with zero repetitions of R
s, U
s, L
s, E
s, and S
s (i.e. with the [RULE]*
part of your Regexp
) and then matches the first digit with the \d
part of your Regexp
.
What you actually want to match is the exact sequence of characters [
followed by R
followed by U
followed by L
followed by E
followed by -
, and then you want to follow this sequence of characters with at least one digit and then the exact character ]
.
So, in order to match an exact sequence of characters, you just write down that sequence of characters. BUT the characters [
and ]
have special meaning in a Regexp
because they denote a character class. They are so-called metacharacters and thus need to be escaped. In Ruby Regexp
, metacharacters are escaped with a backslash \
.
The beginning of our Regexp
now looks like this:
/\[RULE-/
After that, we need to match at least one digit. We already know how to match a digit, we can use the character class \d
. And we know how to match any number of something by using the Kleene star *
. So, if we wanted to match at least one of something, we could match that thing followed by any number of the thing, like this:
/\d\d*/
But actually, there is a specific operator matching at least one: the
operator. a
is equivalent to aa*
, so we can match a number composed of multiple digits like this:
/\d /
After that, we only need to match the closing square bracket, which is again a metacharacter and thus needs to be escaped. The whole Regexp
thus looks like this:
/\[RULE-\d \]/
This will match the pattern [RULE-<any integer with digits from 0 to 9>]
, which is what we want.
However, we are not done yet: we don't just want to check whether our String
contains somewhere the pattern we are looking for, we also want to know the rule number. So, we have to extract the number somehow.
Let's say our test string is
test = 'this is not working [RULE-44]'
With our current Regexp
, when we match the test string, we get back the whole pattern:
re = /\[RULE-\d \]/
scan = test.scan(re)
#=> ['[RULE-44]']
match = re.match(test)
#=> #<MatchData '[RULE-44]'>
So, we somehow need to tell the Regexp
that we don't care about certain parts and do care about others. One way to do this is by using a capturing group for the number. A capturing group is introduced by simply enclosing the part of the Regexp
you want to be captured within round parentheses (
/ )
:
re = /\[RULE-(\d )\]/
scan = test.scan(re)
#=> [['44']]
match = re.match(test)
#=> #<MatchData '[RULE-44]' 1: '44'>
As you can see, when using String#scan
, we now get a nested Array
with one entry and when using Regexp#match
, we get a MatchData
object with the global match and one numbered match. We can access the numbered match by indexing the MatchData
object with the number of the match:
match[1]
#=> '44'
We can give the capture group a name:
re = /\[RULE-(?<rule_number>\d )\]/
match = re.match(test)
#=> #<MatchData "[RULE-44]" rule_number:"44">
This doesn't change the result with String#scan
, but with Regexp#match
, we now get a much nicer MatchData
object and we can access the group by its name:
match[:rule_number]
#=> '44'
An alternative to using a capturing group would be to use assertions. An assertions says "this must match", but the assertion does not become a part of the match itself. There are four kinds of assertions: an assertion can be either positive ("must match") or negative ("must not match") and they can either lookahead or lookbehind (depending on whether you want to assert something before or after).
re = /(?<=\[RULE-)\d (?=\])/
scan = test.scan(re)
#=> ['44']
match = re.match(test)
#=> #<MatchData '44'>
This looks much nicer, doesn't it? There's one last trick we can use: \K
is somewhat similar to a positive lookbehind and basically means "assert that everything before the \K
matches and then forget it":
re = /\[RULE-\K\d (?=\])/
scan = test.scan(re)
#=> ['44']
match = re.match(test)
#=> #<MatchData '44'>
There is one last thing that we could do, depending on exactly what your input data looks like: we could anchor the Regexp
to only match at the end of a line or at the end of the String
. This makes sure that we don't match a case where [RULE-<number>]
appears somewhere in the middle of the text.
There are three different anchors we could use:
$
matches the end of the line,\z
matches the end of theString
, and\Z
matches the end of theString
, but if theString
ends with a newline, then it matches just before the newline.
Of these, the two most useful ones are $
and \Z
in my opinion. So, depending on what your input data looks like, it might make sense to use either one of these two Regexp
s:
re = /\[RULE-\K\d (?=\]$)/
re = /\[RULE-\K\d (?=\]\Z)/
CodePudding user response:
To find the only digits after [RULE
within the square brackets, you could use a capture group:
\[RULE\b[^]\[\d]*(\d )[^]\[\d]*]
Explanation
\[RULE\b
Match[
followed by the wordRULE
[^]\[\d]*
Optionally match any character except[
]
or a digit(\d )
Capture group 1, match 1 digits[^]\[\d]*
Optionally match any character except[
]
or a digit]
Match the closing]
Or more precise for the current example data:
\[RULE-(\d )]