Home > Blockchain >  Oracle: Trying to split a string using REGEXP_SUBSTR, CONNECT BY LEVEL functions and getting duplica
Oracle: Trying to split a string using REGEXP_SUBSTR, CONNECT BY LEVEL functions and getting duplica

Time:08-05

I have a table that has a column that holds an html string that may or may not be well formed (tried to use xmltable route and it didn't work) which is why I am trying to use the following sql syntax. I am trying to create a query that uses REGEXP_SUBSTR, LEVEL and CONNECT BY LEVEL based on examples I found online but I am not able to get the results I would expect:

I created an example query below:

with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')

By running this query, I would expect 7 rows to come back:

ID Contents Data_Jump
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112

Instead its coming back with 22 rows.

ID Contents Data_Jump
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114
2 Customer Servic 112
1 Tecnical Servic 113
2 Other Servic 114
2 Tecnical Servic 113
2 Other Servic 114

I don't know where the problem is that I am getting duplicates. On the actual production table there are a lot of rows so the query just spins and never returns. I suspect its because a lot of duplicates are being generated. Can anyone tell me how to fix the above query so that I can try to transport it onto my actual query?

Thanks,

CodePudding user response:

You need to include the ID in the connect-by clause; but for this to wotk you also need to include a non-deterministic function call; e.g.:

AND PRIOR id = id
AND PRIOR dbms_random.value IS NOT NULL

So

with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
)
SELECT ID,
REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, LEVEL, NULL, 1) as contents,
REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, LEVEL, NULL, 1) as data_jump
FROM qry
CONNECT BY LEVEL <= REGEXP_COUNT(html_string, '<a.*?>(.*?).</a>')
AND PRIOR id = id
AND PRIOR dbms_random.value IS NOT NULL
ID CONTENTS DATA_JUMP
1 Client Servic 111
1 Customer Servic 112
1 Tecnical Servic 113
2 Client Servic 111
2 Customer Servic 112
2 Tecnical Servic 113
2 Other Servic 114

db<>fiddle

Or you could use recursive subquery factoring instead of a hierarchical query, which is maybe easier to understand:

WITH rcte (id, html_string, lvl, contents, data_jump) AS (
  SELECT ID, html_string, 1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, 1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
  FROM qry
  UNION ALL
  SELECT ID, html_string, lvl   1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl   1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl   1, NULL, 1)
  FROM rcte
  WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?).</a>', 1, lvl   1, NULL, 1) IS NOT NULL
  OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl   1, NULL, 1) IS NOT NULL
)
SELECT id, contents, data_jump
FROM rcte
ORDER BY id, lvl

db<>fiddle

Or even, preserving the last character of the 'contents' by removing the stray . from the regex as @Gary_W pointed out:

with qry as (
select 1 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' as html_string from dual
union
select 2 as id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' as html_string from dual
),
rcte (id, html_string, lvl, contents, data_jump) AS (
  SELECT ID, html_string, 1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, 1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, 1, NULL, 1)
  FROM qry
  UNION ALL
  SELECT ID, html_string, lvl   1,
    REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl   1, NULL, 1),
    REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl   1, NULL, 1)
  FROM rcte
  WHERE REGEXP_SUBSTR(html_string, '<a.*?>(.*?)</a>', 1, lvl   1, NULL, 1) IS NOT NULL
  OR REGEXP_SUBSTR(html_string, 'data-jump="(.*?)"', 1, lvl   1, NULL, 1) IS NOT NULL
)
SELECT id, contents, data_jump
FROM rcte
ORDER BY id, lvl
ID CONTENTS DATA_JUMP
1 Client Service 111
1 Customer Service 112
1 Tecnical Service 113
2 Client Service 111
2 Customer Service 112
2 Tecnical Service 113
2 Other Service 114

db<>fiddle

CodePudding user response:

Just cross join with a series of integers. Also more portable than CONNECT BY ...

-- your input, don't use in final query ...
WITH qry AS (
SELECT 1 AS id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div>' AS html_string FROM dual
UNION
SELECT 2 AS id, '<div>What would you like to do today?<a href="#" data-jump="111">Client Service</a><a href="#" data-jump="112">Customer Service</a><a href="#" data-jump="113">Tecnical Service</a></div><a href="#" data-jump="114">Other Service</a></div>' AS html_string FROM dual
)
-- end of input; real query starts here, replace following comma with "WITH"
,
-- need a bunch of consecutive integers ...
i(i) AS (
            SELECT 1 FROM dual
  UNION ALL SELECT 2 FROM dual
  UNION ALL SELECT 3 FROM dual
  UNION ALL SELECT 4 FROM dual
)
SELECT
  id
, REGEXP_SUBSTR(html_string,'(?<=data-jump=")\d ',1,i) AS data_jump
, REGEXP_SUBSTR(html_string,'(?<=\"\>)[^<] ',1,i)      AS content
FROM qry CROSS JOIN i
WHERE REGEXP_SUBSTR(html_string,'(?<=data-jump=")\d ',1,i) <> ''
ORDER BY id,i;
-- out  id | data_jump |     content      
-- out ---- ----------- ------------------
-- out   1 | 111       | Client Service
-- out   1 | 112       | Customer Service
-- out   1 | 113       | Tecnical Service
-- out   2 | 111       | Client Service
-- out   2 | 112       | Customer Service
-- out   2 | 113       | Tecnical Service
-- out   2 | 114       | Other Service
  • Related