Home > Software design >  Regex to fetch number from "This blue fox is [SECTION-1]"
Regex to fetch number from "This blue fox is [SECTION-1]"

Time:08-14

I have the following strings as follows

This blue fox is [RULE-22]

The quick brown fox is like [RULE-1]

I am trying to find a regex that can help me fetching the number associated with the rule.

So far i have tried the following but this does't work

/[RULE]*\d/

The above regular expression returns me number for all the text enclosed by [] like following

this is blue fox [PLAY-22] returns me 22

this is not working [RULE-44] returns me 44

whereas i want this regex to return me number for only RULE enclosed by []

Thanks

CodePudding user response:

Square brackets in a Regexp denote a character class. A character class matches any of the characters in the class, i.e.

/[abc]/

is equivalent to

/a|b|c/

That means that in your Regexp, this sub-part:

/[RULE]/

is equivalent to

/R|U|L|E/

The * in your Regexp is the so-called Kleene star, which specifically in Ruby Regexp means zero or more repetitions.

The \d in your Regexp is an alternative notation for a pre-defined character class. In particular,

/\d/

is equivalent to

/[0-9]/

which is equivalent to

/[0123456789]/

which is equivalent to

/0|1|2|3|4|5|6|7|8|9/

i.e. it matching a single digit from 0 to 9.

Putting it all together, your Regexp matches "any number of Rs, Us, Ls, Es, and Ss directly followed by a single digit".

Now, you might ask yourself: why does this match anything in your test data? In your test data, there is an ASCII hyphen (-) before the digit, but that is not mentioned in the Regexp anywhere? There is no R, U, L, E, or S directly in front of the digit, so why is there a match?

Well, actually, there is "any number of Rs, Us, Ls, Es, and Ss" directly in front of the digit, because "any number of" includes zero! When matching a String with a Regexp, you can consider that between any two characters of the String, there is an arbitrary number of empty Strings for the Regexp to match.

So, your Regexp matches the empty String in between the - and the digit with zero repetitions of Rs, Us, Ls, Es, and Ss (i.e. with the [RULE]* part of your Regexp) and then matches the first digit with the \d part of your Regexp.

What you actually want to match is the exact sequence of characters [ followed by R followed by U followed by L followed by E followed by -, and then you want to follow this sequence of characters with at least one digit and then the exact character ].

So, in order to match an exact sequence of characters, you just write down that sequence of characters. BUT the characters [ and ] have special meaning in a Regexp because they denote a character class. They are so-called metacharacters and thus need to be escaped. In Ruby Regexp, metacharacters are escaped with a backslash \.

The beginning of our Regexp now looks like this:

/\[RULE-/

After that, we need to match at least one digit. We already know how to match a digit, we can use the character class \d. And we know how to match any number of something by using the Kleene star *. So, if we wanted to match at least one of something, we could match that thing followed by any number of the thing, like this:

/\d\d*/

But actually, there is a specific operator matching at least one: the operator. a is equivalent to aa*, so we can match a number composed of multiple digits like this:

/\d /

After that, we only need to match the closing square bracket, which is again a metacharacter and thus needs to be escaped. The whole Regexp thus looks like this:

/\[RULE-\d \]/

This will match the pattern [RULE-<any integer with digits from 0 to 9>], which is what we want.

However, we are not done yet: we don't just want to check whether our String contains somewhere the pattern we are looking for, we also want to know the rule number. So, we have to extract the number somehow.

Let's say our test string is

test = 'this is not working [RULE-44]'

With our current Regexp, when we match the test string, we get back the whole pattern:

re = /\[RULE-\d \]/

scan = test.scan(re)
#=> ['[RULE-44]']

match = re.match(test)
#=> #<MatchData '[RULE-44]'>

Rubular demo

So, we somehow need to tell the Regexp that we don't care about certain parts and do care about others. One way to do this is by using a capturing group for the number. A capturing group is introduced by simply enclosing the part of the Regexp you want to be captured within round parentheses ( / ):

re = /\[RULE-(\d )\]/

scan = test.scan(re)
#=> [['44']]

match = re.match(test)
#=> #<MatchData '[RULE-44]' 1: '44'>

Rubular demo

As you can see, when using String#scan, we now get a nested Array with one entry and when using Regexp#match, we get a MatchData object with the global match and one numbered match. We can access the numbered match by indexing the MatchData object with the number of the match:

match[1]
#=> '44'

We can give the capture group a name:

re = /\[RULE-(?<rule_number>\d )\]/

match = re.match(test)
#=> #<MatchData "[RULE-44]" rule_number:"44">

Rubular demo

This doesn't change the result with String#scan, but with Regexp#match, we now get a much nicer MatchData object and we can access the group by its name:

match[:rule_number]
#=> '44'

An alternative to using a capturing group would be to use assertions. An assertions says "this must match", but the assertion does not become a part of the match itself. There are four kinds of assertions: an assertion can be either positive ("must match") or negative ("must not match") and they can either lookahead or lookbehind (depending on whether you want to assert something before or after).

re = /(?<=\[RULE-)\d (?=\])/

scan = test.scan(re)
#=> ['44']

match = re.match(test)
#=> #<MatchData '44'>

Rubular demo

This looks much nicer, doesn't it? There's one last trick we can use: \K is somewhat similar to a positive lookbehind and basically means "assert that everything before the \K matches and then forget it":

re = /\[RULE-\K\d (?=\])/

scan = test.scan(re)
#=> ['44']

match = re.match(test)
#=> #<MatchData '44'>

Rubular demo

There is one last thing that we could do, depending on exactly what your input data looks like: we could anchor the Regexp to only match at the end of a line or at the end of the String. This makes sure that we don't match a case where [RULE-<number>] appears somewhere in the middle of the text.

There are three different anchors we could use:

  • $ matches the end of the line,
  • \z matches the end of the String, and
  • \Z matches the end of the String, but if the String ends with a newline, then it matches just before the newline.

Of these, the two most useful ones are $ and \Z in my opinion. So, depending on what your input data looks like, it might make sense to use either one of these two Regexps:

re = /\[RULE-\K\d (?=\]$)/
re = /\[RULE-\K\d (?=\]\Z)/

Rubular demo

CodePudding user response:

To find the only digits after [RULE within the square brackets, you could use a capture group:

\[RULE\b[^]\[\d]*(\d )[^]\[\d]*]

Explanation

  • \[RULE\b Match [ followed by the word RULE
  • [^]\[\d]* Optionally match any character except [ ] or a digit
  • (\d ) Capture group 1, match 1 digits
  • [^]\[\d]* Optionally match any character except [ ] or a digit
  • ] Match the closing ]

Rubular demo

Or more precise for the current example data:

\[RULE-(\d )]

Regex demo

  • Related