Home > Enterprise >  Problem with regex (check string for certain repetitions)
Problem with regex (check string for certain repetitions)

Time:04-30

I would like to check whether in a text there are a) three consonants in a row or b) four identical letters in a row. Can someone please help me with the regular expressions?

library(tidyverse)

df <- data.frame(text = c("Completely valid", "abcdefg", "blablabla", "flahaaaa", "asdf", "another text", "a last one", "sj", "ngbas"))

consonants <- c("q", "w", "r", "t", "z", "p", "s", "d", "f", "g", "h", "k", "l", "m", "n", "b", "x")

df %>% mutate(
         invalid = FALSE, 
         # Length too short
         invalid = ifelse(nchar(text)<3, TRUE, invalid),
         # Contains three consonants in a row: e.g. "ngbas"
         invalid = ifelse(str_detect(text,"???"),  TRUE, FALSE),   # <--- Regex missing
         # More than 3 identical characters in a row: e.g. "flahaaaa" 
         invalid = ifelse(str_detect(text,"???"),  TRUE, FALSE)    # <--- Regex missing
       )

CodePudding user response:

Three consonants in a row:

[qwrtzpsdfghklmnbx]{3}

Sequences of length > 3 of a specific char:

([a-z])(\\1){3}
    # The double backslash occurs due to its role as the escape character in strings.

The latter uses a backreference. The number represents the ordinal number assigned to the capture group (= expression in parentheses) that is referenced - in this case the character class of latin lowercase letters.

For clarity, character case is not taken into account here.

Without backreferences, the solution gets a bit lengthy:

(aaaa|bbbb|cccc|dddd|eeee|ffff|gggg|hhhh|iiii|jjjj|kkkk|llll|mmmm|nnnn|oooo|pppp|qqqq|rrrr|ssss|tttt|uuuu|vvvv|wwww|xxxx|yyyy|zzzz)

The relevant docs can be found here.

CodePudding user response:

You don't need to check the length of the word, regexs will made it for you.

In your code you have an error, the last ifelse condition will rewrite any output before, for example if the second ifelse is true and the third false the output will be false, your are making and AND condition.

I correct your error.

Here is the complete code:

df %>% mutate(
         invalid = FALSE,

         # Contains three consonants in a row: e.g. "ngbas"
         invalid = ifelse(str_detect(text,regex("[BCDFGHJKLMNPQRSTVWXYZ]{3}", ignore_case = TRUE)),  TRUE, invalid),   # <--- Regex missing
         # More than 3 identical characters in a row: e.g. "flahaaaa" 
         invalid = ifelse(str_detect(text,regex("([a-zA-Z])\\1{3}", ignore_case = TRUE)),  TRUE, invalid)    # <--- Regex missing
       )
  • Related