I have a table containing a column with strings. I want to extract all pieces of text in each string that come immediately after a certain substring. For this minimum reproducible example, let's assume this substring is abc
. So I want all subsequent terms after abc
.
I'm able to achieve this in cases where there is only 1 abc
per row, but my logic fails when there are multiple abc
s. I'm also getting the number of substring occurrences, but am having trouble relating that to retrieving all of those occurrences.
My approach/attempt:
I created a temp table that contains the # of successful regex matches in my main string:
CREATE TEMP TABLE match_count AS (
SELECT DISTINCT id, main_txt, regexp_count(main_txt, 'abc (\\S )', 1) AS cnt
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S )', 1) > 0);
My output:
id main_txt cnt
1 wpfwe abc weiofnew abc wieone 2
2 abc weoin 1
3 abc weoifn abc we abc w 3
To get my final output, I have a query like:
SELECT id, main_txt, regexp_substr(main_txt, 'abc (\\S )', 1, cnt, 'e') AS output
FROM match_count;
My actual final output:
id main_txt output
1 wpfwe abc weiofnew abc wieone wieone
2 abc weoin weoin
3 abc weoifn abc we abc w w
My expected final output:
id main_txt output
1 wpfwe abc weiofnew abc wieone weiofnew
1 wpfwe abc weiofnew abc wieone wieone
2 abc weoin weoin
3 abc weoifn abc we abc w weoifn
3 abc weoifn abc we abc w we
3 abc weoifn abc we abc w w
So my code only gets the final match (where the occurrence # = cnt
). How can I modify it to include every match?
CodePudding user response:
Unfortunately I don't have access to Redshift to test this, however I have tested it on Oracle (which has similar regexp functions) and it works there. The way this solves the problem is to use a recursive cte to make a list of match numbers for each string (so if there are 2 matches, it generates rows with 1 and 2 in them), these are then joined back to the main table as the occurrence
parameter to regexp_substr
:
WITH RECURSIVE match_counts(id, match_count) AS (
SELECT DISTINCT id, regexp_count(main_txt, 'abc (\\S )', 1)
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S )', 1) > 0
),
match_nums(id, match_num, match_count) AS (
SELECT id, 1, match_count
FROM match_counts
UNION ALL
SELECT id, match_num 1, match_count
FROM match_nums
WHERE match_num < match_count
)
SELECT m.id, main_txt, regexp_substr(main_txt, 'abc (\\S )', 1, match_num, 'e') AS output
FROM my_data_source m
JOIN match_nums n ON m.id = n.id
ORDER BY m.id, n.match_num
Oracle demo on dbfiddle (Note that Oracle doesn't support the e
parameter to regexp_substr
so returns the entire match instead of the group`).
CodePudding user response:
The solutions below do not handle the case where main_text
has consecutive occurrences of abc
consistently.
ex.
wpfwe abc abc abc weiofnew abc wieone
set up
CREATE TABLE test_hal_unnest (id int, main_text varchar (500));
INSERT INTO test_hal_unnest VALUES
(1, 'wpfwe abc weiofnew abc wieone'),
(2, 'abc weoin'),
(3, 'abc weoifn abc we abc w');
Possible solution by splitting the string into words
assuming you are searching for all words that comes after the word abc
in a string, you don't necessarily have to use regex. regex support in redshift is unfortunately not as full featured as postgres or some other databases. for instance, you can't extract all substrings that match a regex pattern to an array, or split a string to an array based on a regex pattern.
steps:
- split text to array with delimiter
' '
- unnest array with ordinality
- look up the previous array element using
LAG
, ordered by the word index - filter rows where the previous word is
abc
the extra columns idx
& prev_word
are left in the final output to illustrate how the problem is solved. they may be dropped from the final query without issue
WITH text_split AS (
SELECT Id
, main_text
, SPLIT_TO_ARRAY(main_text, ' ') text_arr
FROM test_hal_unnest
)
, text_unnested AS (
SELECT ts.id
, ts.main_text
, ts.text_arr
, CAST(ta as VARCHAR) text_word -- converts super >> text
, idx -- this is the word index
FROM text_split ts
JOIN ts.text_arr ta AT idx
ON TRUE
-- ^^ array unnesting happens via joins
)
, with_prevword AS (
SELECT id
, main_text
, idx
, text_word
, LAG(text_word) over (PARTITION BY id ORDER BY idx) prev_word
FROM text_unnested
ORDER BY id, idx
)
SELECT *
FROM with_prevword
WHERE prev_word = 'abc';
output:
id | main_text | idx | text_word | prev_word
---- ------------------------------- ----- ----------- -----------
1 | wpfwe abc weiofnew abc wieone | 2 | weiofnew | abc
1 | wpfwe abc weiofnew abc wieone | 4 | wieone | abc
2 | abc weoin | 1 | weoin | abc
3 | abc weoifn abc we abc w | 1 | weoifn | abc
3 | abc weoifn abc we abc w | 3 | we | abc
3 | abc weoifn abc we abc w | 5 | w | abc
(6 rows)
note on unnest array with ordinality
quoting redshift documentation on this topic, since its kind of hidden
Amazon Redshift also supports an array index when iterating over the array using the AT keyword. The clause x AS y AT z iterates over array x and generates the field z, which is the array index.
alternative shorter solution by splitting on abc
This problem would be more easily solved with regular expression functionality available in redsfhit if instead of
1, wpfwe abc weiofnew abc wieone
the source data was already split up into multiple rows on abc
1, wpfwe
1, abc weiofnew
1, abc wieone
This solution first expands the source data by splitting on abc. however since split_to_array does not accepts are regular expression pattern, we first inject a delimiter ;
before abc
, and then split on ;
.
Any delimiter will work, as long as it is guaranteed not to be present in column main_text
WITH text_array AS (
SELECT
id
, main_text
, SPLIT_TO_ARRAY(REGEXP_REPLACE(main_text, 'abc ', ';abc '), ';') array
FROM test_hal_unnest
)
SELECT
ta.id
, ta.main_text
, REGEXP_SUBSTR(CAST(st AS VARCHAR), 'abc (\\S )', 1, 1, 'e') output
FROM text_array ta
JOIN ta.array st ON TRUE
WHERE st LIKE 'abc%';