What does this regular expression mean in Oracle?-CodePudding

SELECT  REGEXP_REPLACE(LISTAGG(A.ID, ',') WITHIN GROUP (ORDER BY A.ID), '([^,] )(,\1) ', '\1')
FROM    TABLE A

I don't know what "\1" means in the above SQL. After creating a list by separating "A.ID" with commas through "LISTAGG", the purpose seems to be to remove the duplicate "A.ID", but I want to know the exact meaning.

For reference, "A.ID" is a NUMBER(4) column type. (e.g. 1111, 2222...)

CodePudding user response：

It means that you are attempting to replace duplicates.

REGEXP_REPLACE(
  comma_separated_list,
  '([^,] )(,\1) ',
  '\1'
)

Then:

([^,] ) will match one-or-more non-comma characters and store the value in a capturing group.
,\1 will match a comma and then the value from that first capturing group.
(,\1) matches a comma and then the value from that first capturing group and matches it all one-or-more times.

However, it does not work reliably. If you have the list 12,123,123,123,123,1234,4444

Then:

the first match will be:
```
12,123,123,123,123,1234,4444
^^^^^
```
at the start and replace it with just 12 giving:
```
123,123,123,123,1234,4444
^^
```
It has already gone wrong as the match did not match a complete element and you have lost the first id value.
the second match will start after the first match will skip the 3, characters and match:
```
123,123,123,123,1234,4444
    ^^^^^^^^^^^^^^^
```
and replace it with 123 giving:
```
123,1234,4444
    ^^^
```
Again, the match was wrong as it did not match a complete element and it is only coincidence that the value output is correct.
and the final match will be:
```
123,1234,4444
       ^^^
```
replacing it with just 4 and giving the output:
```
123,1234444
```
Which is very wrong as you are now missing the id 12 and have an incorrect id of 1234444.

What you should probably be doing is filtering the duplicates before aggregating.

In newer Oracle versions it is simply:

SELECT  LISTAGG(DISTINCT ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    TABLE

or in older versions:

SELECT  LISTAGG(ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    (
  SELECT DISTINCT id FROM TABLE
)

If you did want to use regular expressions (which will be a more inefficient than using DISTINCT) then you can double up the delimiters and ensure you always match complete elements using (,[^,] ,)\1 but then you also need to remove the repeated delimiters after de-duplication (which makes an inefficient solution even more inefficient):

SELECT  TRIM(
          BOTH ',' FROM
          REPLACE(
            REGEXP_REPLACE(
              LISTAGG(',' || ID || ',') WITHIN GROUP (ORDER BY ID),
              '(,[^,] ,)\1 ',
              '\1'
            ),
            ',,',
            ','
          )
        ) AS no_duplicates
FROM    TABLE_NAME

db<>fiddle here