Home > Software engineering >  Reg Exp don't select if more than one group matches (multiple XOR)
Reg Exp don't select if more than one group matches (multiple XOR)

Time:02-26

The data are held in an Oracle 12c database, one row per ICD-10-CM code, with a patient ID (foreign key) like so (note that there could be many other codes, the following are just the ones pertinent to this question):

ID   ICD10CODE
 1   S72.91XB
 1   S72.92XB
 2   S72.211A
 3   S72.414A
 3   S72.415A
 4   S32.509A
 5   S32.301A
 5   S32.821A
 6   S32.421A
 6   S32.422A
 7   S32.421A
 8   S32.421A
 8   S32.509A

The task at hand is to select distinct patients that match only one of the following points (using standard regular expression syntax):

  • Any number of: S32\.1\w\w\w, S32\.2\w\w\w, S32\.3\w\w\w, S32\.5\w\w\w, S32\.6\w\w\w, S32\.7\w\w\w, S32\.8\w\w\w
  • Any number of: S32\.4\w1\w, S32\.4\w3\w, S32\.4\w4\w, S32\.4\w6\w, S32\.4\w7\w, S32\.4\w8\w, S32\.4\w9\w
  • Any number of: S32\.4\w2\w, S32\.4\w3\w, S32\.4\w5\w, S32\.4\w6\w, S32\.4\w7\w, S32\.4\w8\w, S32\.4\w9\w
  • Any number of: S72\.[0-8]\w1\w, S72\.[0-8]\w3\w, S72\.[0-8]\w4\w, S72\.[0-8]\w6\w, S72\.[0-8]\w7\w, S72\.[0-8]\w8\w, S72\.[0-8]\w9\w
  • Any number of: S72\.[0-8]\w2\w, S72\.[0-8]\w3\w, S72\.[0-8]\w5\w, S72\.[0-8]\w6\w, S72\.[0-8]\w7\w, S72\.[0-8]\w8\w, S72\.[0-8]\w9\w
  • Any number of: S72\.91\w\w, S72\.93\w\w, S72\.94\w\w, S72\.96\w\w, S72\.97\w\w, S72\.98\w\w, S72\.99\w\w
  • Any number of: S72\.92\w\w, S72\.93\w\w, S72\.95\w\w, S72\.96\w\w, S72\.97\w\w, S72\.98\w\w, S72\.99\w\w

Any permutation or combination (including repetitions) of codes listed within a bullet are permitted for each patient, but permutations or combinations across rows should occur mutually exclusively for a patient. My method is to apply LISTAGG on GROUP BY ID:

ID  LISTAGG(ICD10CODE, ',')
 1  S72.91XB,S72.92XB
 2  S72.211A
 3  S72.414A,S72.415A
 4  S32.509A
 5  S32.301A,S32.821A
 6  S32.421A,S32.422A
 7  S32.421A
 8  S32.421A,S32.509A

Then filter using this regular expression, (S32\.(([1-3]|[5-8])|(4\w((1|4)|(2|5)|(3)|([5-9]))))\w )|(S72\.(([0-8]\w((1|4)|(2|5)|(3)|([5-9])))|(9((1|4)|(2|5)|(3)|([5-9]))))\w ), which is almost a literal representation of the bullets above. My expression is adapted from the idea in this answer, where it seems that, ((RB\s ) |(JJ\s ) ) automatically selects either "RB" or "JJ", but not both.

I cannot get it to work. The answer should contain only IDs 2, 4, 5, and 7. But, the expression I developed matches all IDs.

What is a solution to this problem?


[Edit] Some more information:

All these S codes above relate to injuries to the bones in the lower extremity: S32 is for fractures of the pelvis (hip bone), S72 is for fractures of the femur (thigh bone). Note that we have two femurs, and two acetabulum (socket of the pelvis where the femur connects). The S32.4 code denotes the acetabulum (the rest of the S32.[1235678]\w{3} series denotes other parts of the pelvis). Right and left femur and acetabulum are denoted by 1|4 or 2|5 in the 6th character, respectively, unless the code starts with S72.9 when those numbers appear in the 5th character.

The patients to be included in the study population should only have one of the bones broken. That means, one of the two femurs, one of the acetabulum, or the pelvis, but not a combination of them. Combinations of fractures of a single bone do not matter. For example, the right single femur can be broken in 10 different places and ways (the knee area, the middle shaft, the head, etc., each generating a different S72.\w[1|4]\w{2} code), and should still be selected.

CodePudding user response:

Option 1:

You can do it with a single regular expression:

SELECT t.id,
       t.icd10codes
FROM   ( SELECT id,
                LISTAGG(icd10code, ',') WITHIN GROUP (ORDER BY icd10code)
                  AS icd10codes
         FROM   table_name
         GROUP BY id
       ) t
WHERE  REGEXP_LIKE(
         t.icd10codes,
             '^(S32\.[1235678]\w\w\w(,|$)) $'
         || '|^(S32\.4\w[1346789]\w(,|$)) $'
         || '|^(S32\.4\w[2356789]\w(,|$)) $'
         || '|^(S72\.[0-8]\w[1346789]\w(,|$)) $'
         || '|^(S72\.[0-8]\w[2356789]\w(,|$)) $'
         || '|^(S72\.9[1346789]\w\w(,|$)) $'
         || '|^(S72\.9[2356789]\w\w(,|$)) $'
       )

Which, for your sample data:

CREATE TABLE table_name (ID, ICD10CODE) AS
SELECT 1, 'S72.91XB' FROM DUAL UNION ALL
SELECT 1, 'S72.92XB' FROM DUAL UNION ALL
SELECT 2, 'S72.211A' FROM DUAL UNION ALL
SELECT 3, 'S72.414A' FROM DUAL UNION ALL
SELECT 3, 'S72.415A' FROM DUAL UNION ALL
SELECT 4, 'S32.509A' FROM DUAL UNION ALL
SELECT 5, 'S32.301A' FROM DUAL UNION ALL
SELECT 5, 'S32.821A' FROM DUAL UNION ALL
SELECT 6, 'S32.421A' FROM DUAL UNION ALL
SELECT 6, 'S32.422A' FROM DUAL UNION ALL
SELECT 7, 'S32.421A' FROM DUAL UNION ALL
SELECT 8, 'S32.421A' FROM DUAL UNION ALL
SELECT 8, 'S32.509A' FROM DUAL;

Outputs:

ID ICD10CODES
2 S72.211A
4 S32.509A
5 S32.301A,S32.821A
7 S32.421A

Option 2:

You can put the regular expressions into a table:

CREATE TABLE matches (id, match) AS
SELECT 1, 'S32\.[1235678]\w\w\w'    FROM DUAL UNION ALL
SELECT 2, 'S32\.4\w[1346789]\w'     FROM DUAL UNION ALL
SELECT 3, 'S32\.4\w[2356789]\w'     FROM DUAL UNION ALL
SELECT 4, 'S72\.[0-8]\w[1346789]\w' FROM DUAL UNION ALL
SELECT 5, 'S72\.[0-8]\w[2356789]\w' FROM DUAL UNION ALL
SELECT 6, 'S72\.9[1346789]\w\w'     FROM DUAL UNION ALL
SELECT 7, 'S72\.9[2356789]\w\w'     FROM DUAL;

Then you can use the query:

SELECT t.id,
       m.id AS match_id,
       LISTAGG(t.icd10code, ',') WITHIN GROUP (ORDER BY t.icd10code)
         AS icd10codes
FROM   table_name t
       LEFT OUTER JOIN matches m
       PARTITION BY (m.id)
       ON (REGEXP_LIKE(t.icd10code, '^' || m.match || '$'))
GROUP BY
       t.id,
       m.id
HAVING
       COUNT(m.match) = COUNT(t.id);

Option 3:

Similar to the first option, but you can put the matches into a table and you can determine which match has been used:

SELECT t.id,
       m.id AS match_id,
       t.icd10codes
FROM   ( SELECT id,
                LISTAGG(icd10code, ',') WITHIN GROUP (ORDER BY icd10code)
                  AS icd10codes
         FROM   table_name
         GROUP BY id
       ) t
       INNER JOIN matches m
       ON (REGEXP_LIKE(t.icd10codes, '^(' || m.match || '(,|$)) $' ))

Options 2 & 3 both output:

ID MATCH_ID ICD10CODES
4 1 S32.509A
5 1 S32.301A,S32.821A
7 2 S32.421A
2 4 S72.211A

Option 4:

You can also get rid of the (slow) regular expressions and use LIKE if you store the matches as:

CREATE TABLE matches (id, match) AS
SELECT 1, 'S32.1___' FROM DUAL UNION ALL
SELECT 1, 'S32.2___' FROM DUAL UNION ALL
SELECT 1, 'S32.3___' FROM DUAL UNION ALL
SELECT 1, 'S32.5___' FROM DUAL UNION ALL
SELECT 1, 'S32.6___' FROM DUAL UNION ALL
SELECT 1, 'S32.7___' FROM DUAL UNION ALL
SELECT 1, 'S32.8___' FROM DUAL UNION ALL
SELECT 2, 'S32.4_1_' FROM DUAL UNION ALL
SELECT 2, 'S32.4_3_' FROM DUAL UNION ALL
SELECT 2, 'S32.4_4_' FROM DUAL UNION ALL
SELECT 2, 'S32.4_6_' FROM DUAL UNION ALL
SELECT 2, 'S32.4_7_' FROM DUAL UNION ALL
SELECT 2, 'S32.4_8_' FROM DUAL UNION ALL
SELECT 2, 'S32.4_9_' FROM DUAL UNION ALL
SELECT 3, 'S32.4_2_' FROM DUAL UNION ALL
SELECT 3, 'S32.4_3_' FROM DUAL UNION ALL
SELECT 3, 'S32.4_5_' FROM DUAL UNION ALL
SELECT 3, 'S32.4_6_' FROM DUAL UNION ALL
SELECT 3, 'S32.4_7_' FROM DUAL UNION ALL
SELECT 3, 'S32.4_8_' FROM DUAL UNION ALL
SELECT 3, 'S32.4_9_' FROM DUAL UNION ALL
SELECT 4, 'S72.' || (LEVEL - 1) || '_1_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 4, 'S72.' || (LEVEL - 1) || '_3_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 4, 'S72.' || (LEVEL - 1) || '_4_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 4, 'S72.' || (LEVEL - 1) || '_6_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 4, 'S72.' || (LEVEL - 1) || '_7_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 4, 'S72.' || (LEVEL - 1) || '_8_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 4, 'S72.' || (LEVEL - 1) || '_9_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 5, 'S72.' || (LEVEL - 1) || '_2_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 5, 'S72.' || (LEVEL - 1) || '_3_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 5, 'S72.' || (LEVEL - 1) || '_5_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 5, 'S72.' || (LEVEL - 1) || '_6_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 5, 'S72.' || (LEVEL - 1) || '_7_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 5, 'S72.' || (LEVEL - 1) || '_8_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 5, 'S72.' || (LEVEL - 1) || '_9_' FROM DUAL CONNECT BY LEVEL <= 9 UNION ALL
SELECT 6, 'S72.91__' FROM DUAL UNION ALL
SELECT 6, 'S72.93__' FROM DUAL UNION ALL
SELECT 6, 'S72.94__' FROM DUAL UNION ALL
SELECT 6, 'S72.96__' FROM DUAL UNION ALL
SELECT 6, 'S72.97__' FROM DUAL UNION ALL
SELECT 6, 'S72.98__' FROM DUAL UNION ALL
SELECT 6, 'S72.99__' FROM DUAL UNION ALL
SELECT 7, 'S72.92__' FROM DUAL UNION ALL
SELECT 7, 'S72.93__' FROM DUAL UNION ALL
SELECT 7, 'S72.95__' FROM DUAL UNION ALL
SELECT 7, 'S72.96__' FROM DUAL UNION ALL
SELECT 7, 'S72.97__' FROM DUAL UNION ALL
SELECT 7, 'S72.98__' FROM DUAL UNION ALL
SELECT 7, 'S72.99__' FROM DUAL;

Then use the query:

SELECT t.id,
       m.id AS match_id,
       LISTAGG(t.icd10code, ',') WITHIN GROUP (ORDER BY t.icd10code)
         AS icd10codes
FROM   table_name t
       LEFT OUTER JOIN matches m
       PARTITION BY (m.id)
       ON (t.icd10code LIKE m.match)
GROUP BY
       t.id,
       m.id
HAVING
       COUNT(m.match) = COUNT(t.id);

db<>fiddle here

CodePudding user response:

Ok, I've added your broken bones codes to the S32 and S72 series.
That's all that needed to be done really.

Feel free to change ,? to (,|$) but don't change anything else.
Let me know if the broken bones codes is about right.

  • S32\.\w[25]\w\w, S32.1\w\w\w, S32.2\w\w\w, S32.3\w\w\w, S32.5\w\w\w, S32.6\w\w\w, S32.7\w\w\w, S32.8\w\w\w
  • S32\.\w[25]\w\w, S32.4\w1\w, S32.4\w3\w, S32.4\w4\w, S32.4\w6\w, S32.4\w7\w, S32.4\w8\w, S32.4\w9\w
  • S32\.\w[25]\w\w, S32.4\w2\w, S32.4\w3\w, S32.4\w5\w, S32.4\w6\w, S32.4\w7\w, S32.4\w8\w, S32.4\w9\w
  • S72\.\w[14]\w\w, S72.[0-8]\w1\w, S72.[0-8]\w3\w, S72.[0-8]\w4\w, S72.[0-8]\w6\w, S72.[0-8]\w7\w, S72.[0-8]\w8\w, S72.[0-8]\w9\w
  • S72\.\w[14]\w\w, S72.[0-8]\w2\w, S72.[0-8]\w3\w, S72.[0-8]\w5\w, S72.[0-8]\w6\w, S72.[0-8]\w7\w, S72.[0-8]\w8\w, S72.[0-8]\w9\w
  • S72\.[14]\w\w\w, S72.91\w\w, S72.93\w\w, S72.94\w\w, S72.96\w\w, S72.97\w\w, S72.98\w\w, S72.99\w\w
  • S72\.[14]\w\w\w, S72.92\w\w, S72.93\w\w, S72.95\w\w, S72.96\w\w, S72.97\w\w, S72.98\w\w, S72.99\w\w

The new regex is

^((S32\.(\w[25]|[1-35-8]\w)\w\w,?) |(S32\.(\w[25]\w|4\w[1346-9])\w,?) |(S32\.(\w[25]\w|4\w[235-9])\w,?) |(S72\.(\w[14]\w|[0-8]\w[1346-9])\w,?) |(S72\.(\w[14]\w|[0-8]\w[235-9])\w,?) |(S72\.([14]\w|9[1346-9])\w\w,?) |(S72\.([14]\w|9[235-9])\w\w,?) )$

https://regex101.com/r/OAHdCO/1

 ^ 
 (
    ( S32 \. ( \w [25] | [1-35-8] \w ) \w\w ,? ) 
  | ( S32 \. ( \w [25] \w | 4 \w [1346-9] ) \w ,? ) 
  | ( S32 \. ( \w [25] \w | 4 \w [235-9] ) \w ,? ) 
  | ( S72 \. ( \w [14] \w | [0-8] \w [1346-9] ) \w ,? ) 
  | ( S72 \. ( \w [14] \w | [0-8] \w [235-9] ) \w ,? ) 
  | ( S72 \. ( [14] \w | 9 [1346-9] ) \w\w ,? ) 
  | ( S72 \. ( [14] \w | 9 [235-9] ) \w\w ,? ) 
 )
 $
  • Related