Home > Back-end >  Extract a certain number from a character column using regex
Extract a certain number from a character column using regex

Time:05-08

I have a regex problem.

For a long time I've run the code bellow, extracting only the number of a statutory provision expressed in court verdicts. Until yesterday it has worked as intended, but now something has changed... In a way I suspect that R no longer recognize the §-symbol, but I have no basis for that suspicion. The code is:

dt <- dt %>%
   mutate(parnumber = str_match(char_column, pattern = 'Straffeloven[^§] §*([0-9] )')) %>%
   mutate(par = parnumber[,2])

An example of a verdict that is in the char_column is: "Straffeloven (1902) §162".

Another example is the year withoout parenthesis:

"Straffeloven 1902 §162".

Not all rows include the year. Another example is: "Straffeloven §219".

Now the code only returns the last siffer from the number after the "§". from the first example over, the new column "par" only contains the number 2, when I want it to contain 162.

I've tried to use other variations of the regex, fex: Straffeloven( ([0-9] ))?[^0-9] ([0-9] ) or Straffeloven[^0-9] ([0-9] ).

The first one gets the error: '(' is an unrecognized escape in character string starting "'Straffeloven(\ (".

The other one works better, but when the year is included in parenthesis, it is the year that is extracted, not the number that follows the §-sign.

Does anyone have an idea of an alternative regex that extracts the whole number that follows the §-sign, in the rows that has the word "Straffeloven", while ignoring the years (which are either 2005 or 1902)?

CodePudding user response:

This works,

vec <- c("Straffeloven (1902) §162", "Straffeloven 1902 §162", "Straffeloven §219", "Not this §219", "Straffeloven nor 219")
strcapture("Straffeloven.*§([0-9] )\\b", vec, list(par = 0L))
#   par
# 1 162
# 2 162
# 3 219
# 4  NA
# 5  NA

CodePudding user response:

I guess I'm just old... str_extract should be a good function as well. Just exit your ( with \\( and that should tell R to not treat it as regex but as the actual character of parenthesis.

https://regex101.com/

Good place to test and have a good cheatsheet nearby to help on this. More Regex than R really, but R's syntax requiring the double escapes can be frustrating often...

  • Related