Home > Back-end >  How to account for all whitespace (spaces, tabs, new lines) in regexp_substr() method?
How to account for all whitespace (spaces, tabs, new lines) in regexp_substr() method?

Time:02-11

I have the following code snippet in SQL to select the next piece of text after ABC DEF that's of variable length:

SELECT trim('ABC DEF ' FROM regexp_substr(my_field, 'ABC DEF ([^ ] )')) FROM my_table

Sample Data:

'{random text here}           ABC DEF          {my_variable_length_keyword} {random text here}'

Expected Output: {my_variable_length_keyword}

While this works, it only accounts for cases where there is one space after ABC DEF. How would I deal with cases where there are tabs, new lines, or multiple spaces before the next word?

I've tried:

SELECT trim('ABC DEF ' FROM regexp_substr(my_field, 'ABC DEF\s ([^ ] )')) FROM my_table

But this doesn't yield any result.

Can someone please help me out with this? Thank you!

CodePudding user response:

(?<=ABC DEF)\s \S or (?<=ABC DEF)\\s \\S for string literal used in other language.

Trim the leading whitespaces, then you'll get the final output.

Some notes:

  1. (?<=pattern) is positive lookbehind, it works like normal regex but won't be put into match result, place it before strings that you want to match.
  2. in most platform, pattern in lookbehind doesn't support * and . For example, (?<=ABC DEF\s ) is unsupported in most cases, you need to move \s outside to match whitespaces and trim them later, check Important Notes About Lookbehind
  3. Don't know what pattern {my_variable_length_keyword} has though, but it seems it doesn't contain whitespaces, \S can match one non-whitespace character.

CodePudding user response:

You do not need to use any lookarounds, nor do you need to trim the prefix matched with the regex, since with REGEXP_SUBSTR you can specify a capturing group and extract just its contents.

SELECT regexp_substr(my_field, 'ABC\\s DEF\\s (\\S )', 1, 1, 'e') FROM my_table

This is the regex demo.

Details:

  • ABC\s DEF\s (\S ) - ABC, one or more whitespaces, DEF, one or more whitespaces (this text is just matched) and then one or more non-whitespace chars are captured into Group 1 (note you need to double escape backslashes as the single backslash is used to form string escape sequences, and to introduce a literal \ it needs doubling)
  • 1 - start searching from the first char in the string
  • 1 - tells the regex engine to extract the first occurrence of the pattern
  • 'e' - enables extraction of the subexpression (aka capturing group value) from the resulting match.
  • Related