How to use regexp_count with regexp_substr to output multiple matches per string in SQL (Redshift)?-CodePudding

I have a table containing a column with strings. I want to extract all pieces of text in each string that come immediately after a certain substring. For this minimum reproducible example, let's assume this substring is abc. So I want all subsequent terms after abc.

I'm able to achieve this in cases where there is only 1 abc per row, but my logic fails when there are multiple abcs. I'm also getting the number of substring occurrences, but am having trouble relating that to retrieving all of those occurrences.

My approach/attempt:

I created a temp table that contains the # of successful regex matches in my main string:

CREATE TEMP TABLE match_count AS (
SELECT DISTINCT id, main_txt, regexp_count(main_txt, 'abc (\\S )', 1) AS cnt
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S )', 1) > 0);

My output:

id   main_txt                         cnt
1    wpfwe abc weiofnew abc wieone    2
2    abc weoin                        1
3    abc weoifn abc we abc w          3

To get my final output, I have a query like:

SELECT id, main_txt, regexp_substr(main_txt, 'abc (\\S )', 1, cnt, 'e') AS output
FROM match_count;

My actual final output:

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          w

My expected final output:

id   main_txt                         output
1    wpfwe abc weiofnew abc wieone    weiofnew
1    wpfwe abc weiofnew abc wieone    wieone
2    abc weoin                        weoin
3    abc weoifn abc we abc w          weoifn
3    abc weoifn abc we abc w          we
3    abc weoifn abc we abc w          w

So my code only gets the final match (where the occurrence # = cnt). How can I modify it to include every match?

CodePudding user response：

Unfortunately I don't have access to Redshift to test this, however I have tested it on Oracle (which has similar regexp functions) and it works there. The way this solves the problem is to use a recursive cte to make a list of match numbers for each string (so if there are 2 matches, it generates rows with 1 and 2 in them), these are then joined back to the main table as the occurrence parameter to regexp_substr:

WITH RECURSIVE match_counts(id, match_count) AS (
  SELECT DISTINCT id, regexp_count(main_txt, 'abc (\\S )', 1)
  FROM my_data_source
  WHERE regexp_count(main_txt, 'abc (\\S )', 1) > 0
),
match_nums(id, match_num, match_count) AS (
  SELECT id, 1, match_count
  FROM match_counts
  UNION ALL
  SELECT id, match_num   1, match_count
  FROM match_nums
  WHERE match_num < match_count
)
SELECT m.id, main_txt, regexp_substr(main_txt, 'abc (\\S )', 1, match_num, 'e') AS output
FROM my_data_source m
JOIN match_nums n ON m.id = n.id
ORDER BY m.id, n.match_num

Oracle demo on dbfiddle (Note that Oracle doesn't support the e parameter to regexp_substr so returns the entire match instead of the group`).

CodePudding user response：

The solutions below do not handle the case where main_text has consecutive occurrences of abc consistently.

ex.

wpfwe abc abc abc weiofnew abc wieone

set up

CREATE TABLE test_hal_unnest (id int, main_text varchar (500));
INSERT INTO test_hal_unnest VALUES 
(1, 'wpfwe abc weiofnew abc wieone'),
(2, 'abc weoin'),
(3, 'abc weoifn abc we abc w');

Possible solution by splitting the string into words

assuming you are searching for all words that comes after the word abc in a string, you don't necessarily have to use regex. regex support in redshift is unfortunately not as full featured as postgres or some other databases. for instance, you can't extract all substrings that match a regex pattern to an array, or split a string to an array based on a regex pattern.

steps:

split text to array with delimiter ' '
unnest array with ordinality
look up the previous array element using LAG, ordered by the word index
filter rows where the previous word is abc

the extra columns idx & prev_word are left in the final output to illustrate how the problem is solved. they may be dropped from the final query without issue

WITH text_split AS (
SELECT Id
, main_text
, SPLIT_TO_ARRAY(main_text, ' ') text_arr
FROM test_hal_unnest
)
, text_unnested AS (
SELECT ts.id
, ts.main_text
, ts.text_arr
, CAST(ta as VARCHAR) text_word -- converts super >> text
, idx -- this is the word index
FROM text_split ts
JOIN ts.text_arr ta AT idx 
  ON TRUE
-- ^^ array unnesting happens via joins

)
, with_prevword AS (
SELECT id
, main_text
, idx
, text_word
, LAG(text_word) over (PARTITION BY id ORDER BY idx) prev_word
FROM text_unnested
ORDER BY id, idx
)
SELECT *
FROM with_prevword
WHERE prev_word = 'abc';

output:

 id |           main_text           | idx | text_word | prev_word
---- ------------------------------- ----- ----------- -----------
  1 | wpfwe abc weiofnew abc wieone |   2 | weiofnew  | abc
  1 | wpfwe abc weiofnew abc wieone |   4 | wieone    | abc
  2 | abc weoin                     |   1 | weoin     | abc
  3 | abc weoifn abc we abc w       |   1 | weoifn    | abc
  3 | abc weoifn abc we abc w       |   3 | we        | abc
  3 | abc weoifn abc we abc w       |   5 | w         | abc
(6 rows)

note on unnest array with ordinality

quoting redshift documentation on this topic, since its kind of hidden

Amazon Redshift also supports an array index when iterating over the array using the AT keyword. The clause x AS y AT z iterates over array x and generates the field z, which is the array index.

alternative shorter solution by splitting on `abc`

This problem would be more easily solved with regular expression functionality available in redsfhit if instead of

1, wpfwe abc weiofnew abc wieone

the source data was already split up into multiple rows on abc

1, wpfwe
1, abc weiofnew
1, abc wieone

This solution first expands the source data by splitting on abc. however since split_to_array does not accepts are regular expression pattern, we first inject a delimiter ; before abc, and then split on ;.

Any delimiter will work, as long as it is guaranteed not to be present in column main_text

WITH text_array AS (
SELECT
  id
, main_text
, SPLIT_TO_ARRAY(REGEXP_REPLACE(main_text, 'abc ', ';abc '), ';') array
FROM test_hal_unnest
)
SELECT
  ta.id
, ta.main_text
, REGEXP_SUBSTR(CAST(st AS VARCHAR), 'abc (\\S )', 1, 1, 'e') output
FROM text_array ta
JOIN ta.array st ON TRUE
WHERE st LIKE 'abc%';

set up

Possible solution by splitting the string into words

note on unnest array with ordinality

alternative shorter solution by splitting on abc

alternative shorter solution by splitting on `abc`