Home > other >  Oracle SQL - matching different rows by multiple fields
Oracle SQL - matching different rows by multiple fields

Time:08-13

I am trying to solve the matching of different customers, from different entities. Using only SQL select and "with" syntax, no PL/SQL. Below is a simple example. There are three customers, who are each 3 different individuals, but are recorded slightly different in different entities (1,2,3). I am matching them by P1, P2 and P3 field. In addition, P2 and P3 needs to have a cross-field check.

Customer David - only in entity 1 and 2, in both entities has P1 = 100. Easy match. Gets 'G1'

Customer Lloyd - only in entity 1 and 3, matches with cross-field check of P2 = P3. Gets 'G2'

Customer Mark - in all three entities. In entity 1 and 3 they match by P1. But the row in entity 2 matches only with cross-field check with entity 1, P2. This person needs to have 'G3' assigned in all three entities. Thoughest case.

The G, eg. G3243 represents customer's unique group ID. It is the same for one customer inside different entities and different from all other customers' Group IDs.

Any ideas how to do this? Sample table below:

table

sample data:

with 
sample as (
select 1 as entity, 'DAVID' as customer, 100 as P1, 'hjk' as P2, null as P3 from dual
union all
select 2 as entity, 'DAVID' as customer, 100 as P1, 'heeee' as P2, null as P3 from dual
union all
select 1 as entity, 'Lloyd' as customer, null as P1, 'dfe' as P2, null as P3 from dual
union all
select 1 as entity, 'LLOYD' as customer, null as P1, null as P2, 'dfe' as P3 from dual
union all
select 1 as entity, 'MARK' as customer, 300 as P1, 'abc' as P2, null as P3 from dual
union all
select 2 as entity, 'MARC' as customer, 0 as P1, null as P2, 'abc' as P3 from dual
union all
select 3 as entity, 'Mark' as customer, 300 as P1, 'texttt' as P2, null as P3 from dual

)
select *  from sample;

CodePudding user response:

If I got it right the expected outcome should say how many correlations are present in the data for each customer. Column P1 is considered on its own and columns P2 and P3 define posible additional ties if they are cross related. Here is one possible solution using your initial sample data.
1. try to count relations on columns P1, P2 and P3 by creating cte ("sample_counts")

    sample_counts AS
        (
            Select
                s1.ENTITY "ENTITY",
                UPPER(s1.CUSTOMER) "CUSTOMER",
                Max(CASE WHEN s1.P1 Is Not Null THEN 1 ELSE 0 END) "CNT_P1",
                Max((Select Count(*) From sample Where P2 = s1.P3 )) "CNT_P2_P3",
                Max((Select Count(*) From sample Where P3 = s1.P2 )) "CNT_P3_P2",
                Max(Nvl((Select UPPER(CUSTOMER) From sample Where P2 = s1.P3 ), UPPER(s1.CUSTOMER))) "CUSTOMER_2"
            From
                sample s1
            Group By
                UPPER(s1.CUSTOMER), s1.ENTITY
            Order By 
                UPPER(s1.CUSTOMER), s1.ENTITY
        )
/*
    "sample_counts" (cte) resulting dataset

        ENTITY CUSTOMER     CNT_P1  CNT_P2_P3  CNT_P3_P2 CUSTOMER_2
    ---------- -------- ---------- ---------- ---------- ----------
             1 DAVID             1          0          0 DAVID      
             2 DAVID             1          0          0 DAVID      
             1 LLOYD             0          1          1 LLOYD      
             2 MARC              1          1          0 MARK       
             1 MARK              1          0          1 MARK       
             3 MARK              1          0          0 MARK       
*/

The resulting dataset gives us counts of relations (as 1 or 0) over different columns for each customer and did some naming unification in column CUSTOMER_2. Even from this dataset it is obvious that David has just one level of correlations (CNT_P1), that Lloyd has two levels by cross joinning CNT_P2 and CNT_P3 (none on CNT_P1), and finally that Mark (or Marc) has correlation on all three colummns. As dataset is grouped by, in the main SQL we should join it to your sample data in order to get all rows but now with Gx representing the group correlations as G1, G2, G3.
Here is the final SQL with the result:

SELECT
    sc.ENTITY "ENTITY",
    sc.CUSTOMER_2 "CUSTOMER",
    s.P1 "P1",
    s.P2 "P2",
    s.P3 "P3",
    'G' || To_Char(
                    Max(CNT_P1) OVER (PARTITION BY sc.CUSTOMER_2 ORDER BY sc.CUSTOMER_2)  
                    Max(CNT_P2_P3) OVER (PARTITION BY sc.CUSTOMER_2 ORDER BY sc.CUSTOMER_2)   
                    Max(CNT_P3_P2) OVER (PARTITION BY sc.CUSTOMER_2 ORDER BY sc.CUSTOMER_2) 
                  ) "GRP"
FROM
    sample_counts sc
INNER JOIN 
    sample s ON(s.ENTITY = sc.ENTITY And UPPER(s.CUSTOMER) = sc.CUSTOMER)
ORDER BY 
    CUSTOMER_2, ENTITY
--  
--  R e s u l t :
--  
--      ENTITY CUSTOMER         P1 P2     P3  GRP                                     
--  ---------- -------- ---------- ------ --- --------
--           1 DAVID           100 hjk        G1      
--           2 DAVID           100 heeee      G1      
--           1 LLOYD               dfe        G2      
--           1 LLOYD                      dfe G2      
--           1 MARK            300 abc        G3      
--           2 MARK              0        abc G3      
--           3 MARK            300 texttt     G3 

If you want to have the original customer name from your data then just select s.CUSTOMER here (instead of sc.CUSTOMER_2). Try it with your actual data. I realy do hope this would be usefull. Regards...

A D D I T I O N

Please try this code on your data. I added a few more rows (John and Anne) in the sample data and changed the code according to the comments. Hope it could help you now:

with 
    sample as 
        (
            select 1 as entity, 'DAVID' as customer, 100 as P1, 'hjk' as P2, null as P3 from dual   union all
            select 2 as entity, 'DAVID' as customer, 100 as P1, 'heeee' as P2, null as P3 from dual union all
            select 1 as entity, 'JOHN' as customer, 200 as P1, 'hjklm' as P2, null as P3 from dual  union all
            select 2 as entity, 'JOHN' as customer, 200 as P1, 'xdxxdxd' as P2, null as P3 from dual union all
            select 1 as entity, 'Lloyd' as customer, null as P1, 'dfe' as P2, null as P3 from dual  union all
            select 1 as entity, 'LLOYD' as customer, null as P1, null as P2, 'dfe' as P3 from dual  union all
            select 1 as entity, 'ANNE' as customer, null as P1, Null as P2, 'xls' as P3 from dual   union all
            select 2 as entity, 'ANNE' as customer, null as P1, 'xls' as P2, Null as P3 from dual   union all
            select 1 as entity, 'MARK' as customer, 300 as P1, 'abc' as P2, null as P3 from dual    union all
            select 2 as entity, 'MARC' as customer, 0 as P1, null as P2, 'abc' as P3 from dual      union all
            select 3 as entity, 'Mark' as customer, 300 as P1, 'texttt' as P2, null as P3 from dual
        ),
    sample_counts AS
        (
            Select
                ROWNUM "RN", s1.ENTITY "ENTITY", UPPER(s1.CUSTOMER) "CUSTOMER", CASE WHEN s1.P1 Is Not Null THEN 1 ELSE 0 END "CNT_P1",
                Nvl(To_Char(s1.P1), '0') "P1", (Select Count(*) From sample Where P2 = s1.P3 ) "CNT_P2_P3", Nvl(To_Char(s1.P2), '0') "P2",
                (Select Count(*) From sample Where P3 = s1.P2 ) "CNT_P3_P2", Nvl(To_Char(s1.P3), '0') "P3",
                Nvl((Select UPPER(CUSTOMER) From sample Where P2 = s1.P3 ), UPPER(s1.CUSTOMER)) "CUSTOMER_2"
            From  sample s1
            Order By RN
        ),
    p_data AS
        (
            Select RN, CUSTOMER_2, 'P1' "COL", To_Char(P1) "PX", To_Char(CNT_P1) "CNT_X" From sample_counts Union All
            Select RN, CUSTOMER_2, 'P2' "COL", To_Char(P2) "PX", To_Char(CNT_P2_P3) "CNT_X" From sample_counts Union All
            Select RN, CUSTOMER_2, 'P3' "COL", To_Char(P3) "PX", To_Char(CNT_P3_P2) "CNT_X" From sample_counts 
            Order By RN, COL
        ),
    matches AS
        (
            SELECT  p.RN, p.CUSTOMER_2, p.COL, p.PX, p.CNT_X, sc.P1, sc.P2, sc.P3
            FROM    p_data p
            INNER JOIN  sample_counts sc ON(sc.RN = p.RN)
            WHERE   p.CNT_X <> '0'
            ORDER BY p.RN
        )
SELECT DISTINCT
    sc.ENTITY, sc.CUSTOMER_2, sc.P1 "P1", sc.P2 "P2", sc.P3 "P3", 
    'G' || To_Char(Min(m.RN) OVER(PARTITION BY sc.CUSTOMER_2 ORDER BY sc.CUSTOMER_2)) "GRP"
FROM
    sample_counts sc
INNER JOIN
    matches m ON (m.RN = sc.RN)
ORDER BY 
    sc.CUSTOMER_2, sc.ENTITY
--  
--  R e s u l t :
-- 
--      ENTITY CUSTOMER_2 P1                                       P2      P3  GRP                                     
--  ---------- ---------- ---------------------------------------- ------- --- -----------------------------------------
--           1 ANNE       0                                        0       xls G7                                        
--           2 ANNE       0                                        xls     0   G7                                        
--           1 DAVID      100                                      hjk     0   G1                                        
--           2 DAVID      100                                      heeee   0   G1                                        
--           1 JOHN       200                                      hjklm   0   G3                                        
--           2 JOHN       200                                      xdxxdxd 0   G3                                        
--           1 LLOYD      0                                        0       dfe G5                                        
--           1 LLOYD      0                                        dfe     0   G5                                        
--           1 MARK       300                                      abc     0   G9                                        
--           2 MARK       0                                        0       abc G9                                        
--           3 MARK       300                                      texttt  0   G9       

  • Related