Home > Software design >  Postgresql - Regex to get all words in string without special characters except -
Postgresql - Regex to get all words in string without special characters except -

Time:09-16

Input

Word-Word, Some other words    and this is another word et     another one   

Expected output

Word-Word
Some
other
words
this
is
another
word
another
one

I have a table (t) with many strings like the one showed in the input.

I'm trying to get every word in the sentence but the comas (','), the word 'and', 'et', 'und' and of course every whitespace or sequence of whitespace that may be between words.

Regex that I'm using:

  1. Don't match whitespace \\s
  2. Don't match whitespace as long as special characters ((\b[^\s] \b)((?<=\.\w).)?) - doesn't work in postgres for some reason
  3. Don't match a particular word ^(?!et$|and$|und$) - doesn't work either

Query that I'm running

SELECT word FROM t, 
unnest(regexp_split_to_array(t.word, E'Missing expression')) as word;

CodePudding user response:

You can use an extracting approach here in the following way:

SELECT regexp_matches(
   'Word-Word, Some other words    and this is another word et     another one   ',
   E'\\y(?!(?:et|[ua]nd)\\y)\\w (?:-\\w )*',
   'g');

See the online demo. Regex details:

  • \y - a word boundary
  • (?!(?:et|[ua]nd)\y) - a negative lookahead that fails the match if there is et, und or and as whole words immediately to the right of the current location
  • \w (?:-\w )* - one or more word chars and then zero or more occurrences of - and one or more word char sequences

See the regex demo (converted to PCRE).

  • Related