Home > other >  PostgreSQL : check if value is in 2 columns and remove it from one of them
PostgreSQL : check if value is in 2 columns and remove it from one of them

Time:11-06

I have two columns ind and tar that both contain arrays.

 ind      tar
{10}      {10}
{6}       {5,6}
{4,5,6}   {5,6}
{5,6}     {5,6}
{7,8}     {11}
{11}      {5,6,7}
{11}      {8}
{9,10}    {6}

I want to find if one value exists in both arrays, and if that's true, I want to keep it only at column ind. For example, at the first row I have the value 10 in both columns. I want to end up with this value only in column ind and leave column tar empty. This is the expected result:

 ind      tar      
{10}      
{6}       {5}
{4,5,6}   
{5,6}     
{7,8}     {11}
{11}      {5,6,7}
{11}      {8}
{9,10}    {6}

How can I do that in PostgreSQL?

So far I only managed to find the common elements, but I don't know how to continue with keeping them only at ind column and remove them from tar column.

with t1 as (
select distinct ind, tar
from table_1
join table_2 using (id) 
limit 50
),
t2 as (
select ind & tar as common_el, ind , tar 
from t1
)
select *
from t2
<iframe name="sif1" sandbox="allow-forms allow-modals allow-scripts" frameborder="0"></iframe>

which results into this:

   common_el   ind        tar
    {10}      {10}       {10}
    {6}       {6}       {5,6}
    {5,6}     {4,5,6}   {5,6}
    {5,6}     {5,6}     {5,6}

CodePudding user response:

The & operator you are using is from the intarray module which also allows you to use - to remove elements in one array from another.

For eg.

select
    ind,
    tar,
    ind & tar as common_el,
    tar - (ind & tar) as  new_tar
from
    table_1
ind tar common_el new_tar
{10} {10} {10} {}
{6} {5,6} {6} {5}
{4,5,6} {5,6} {5,6} {}
{5,6} {5,6} {5,6} {}
{7,8} {11} {} {11}
{11} {5,6,7} {} {5,6,7}
{11} {8} {} {8}
{9,10} {6} {} {6}

View working demo db fiddle here

CodePudding user response:

Introduction:

I decided to tackle this using only PostgreSQL's standard functionality - i.e. without using the INTARRAY extension - nothing wrong with using it, just fancied a challenge this afternoon! There's also a simplified method using INTARRAY functionality - simpler than the accepted answer!

Method 1 (standard array functionality with CASE - fiddle):

SELECT
   (ind && tar) AS overlaps,
   ind, tar,
   ind - tar AS ind_m_tar,
   tar - ind AS tar_m_ind,
  CASE
    WHEN ind = tar THEN ind                        -- 1
    WHEN NOT (ind && tar) THEN ind                 -- 2

    WHEN ind <@ tar 
      AND (CARDINALITY (tar) > CARDINALITY (ind))
        THEN ind                                   -- 3                           
    
    WHEN tar <@ ind 
      AND (CARDINALITY (ind) > CARDINALITY (tar))
        THEN ind                                   -- 4
      
  END AS new_ind,

  CASE
    WHEN ind = tar THEN NULL                       -- 1
    WHEN NOT (ind && tar) THEN tar                 -- 2
    WHEN ind <@ tar 
      AND (CARDINALITY (tar) > CARDINALITY (ind))
        THEN tar - ind                             -- 3     
    WHEN tar <@ ind 
      AND (CARDINALITY (ind) > CARDINALITY (tar))
        THEN NULL                                  -- 4
        
  END AS new_tar
FROM test;

Result:

overlaps     ind      tar   ind_m_tar   tar_m_ind   new_ind new_tar
       t    {10}     {10}          {}          {}      {10}    NULL
       t     {6}    {5,6}          {}         {5}       {6}     {5}
       t {4,5,6}    {5,6}         {4}          {}   {4,5,6}    NULL 
       t   {5,6}    {5,6}          {}          {}     {5,6}    NULL
       f   {7,8}     {11}       {7,8}        {11}     {7,8}    {11}
       f    {11}  {5,6,7}        {11}     {5,6,7}      {11} {5,6,7}
       f    {11}      {8}        {11}         {8}      {11}     {8}
       f  {9,10}      {6}      {9,10}         {6}    {9,10}     {6}

So, we can see that by combining standard ARRAY functions, we can SELECT the necessary fields using a CASE expression - obviously, you can only SELECT the fields that are relevant to you - I left the others in so that the logic could be followed.

Second method (using UNNEST and INTERSECT - fiddle):

First thing is to obtain the values to be eliminated from the tar arrays - we use the often neglected INTERSECT operator to do this.

--
-- Get the values to be deleted from the tar arrays, by ROW_NUMBER()
--

WITH cte1 AS 
(
  SELECT
    ROW_NUMBER() OVER () AS rn,
    UNNEST(ind) AS un_ind,
    UNNEST(tar) AS un_tar
  FROM test
)
SELECT rn, un_ind AS x FROM cte1
INTERSECT
SELECT rn, un_tar FROM cte1
ORDER BY rn, x;

Result:

rn    x
 1   10
 2    6
 3    5
 3    6
 4    5
 4    6

We can see that this is correct - for the 1st ind, tar array, we want to remove the 2nd (i.e. tar) value of 10 - for the second array, we wish to remove the duplicated 6 value, and so on.

So, our next query is:

WITH cte1 AS 
(
  SELECT
    ROW_NUMBER() OVER () AS rn,
    UNNEST(ind) AS un_ind,
    UNNEST(tar) AS un_tar
  FROM test
), 
cte2 AS
(
  SELECT rn, un_ind AS x FROM cte1
  INTERSECT
  SELECT rn, un_tar FROM cte1
  ORDER BY rn, x
),
cte3 AS
(
  SELECT
    rn,
    un_ind,
    CASE
      WHEN (cte1.rn, cte1.un_tar) = 
      (
        SELECT rn, x 
        FROM cte2 
        WHERE rn = cte1.rn 
        AND x = cte1.un_tar
      ) THEN NULL
      ELSE un_tar
    END 
  FROM cte1
)
SELECT * FROM cte3;

Result:

rn  un_ind  un_tar
 1      10    NULL
 2       6       5
 2    NULL    NULL
 3       4    NULL
 3       5    NULL
 3       6    NULL
 4       5    NULL
 4       6    NULL
 5       7      11
 5       8    NULL
 6      11       5
 6    NULL       6
 6    NULL       7
 7      11       8
 8       9       6
 8      10    NULL  

The final step:

WITH cte1 AS 
(
  SELECT
    ROW_NUMBER() OVER () AS rn,
    UNNEST(ind) AS un_ind,
    UNNEST(tar) AS un_tar
  FROM test
), 
cte2 AS
(
  SELECT rn, un_ind AS x FROM cte1
  INTERSECT
  SELECT rn, un_tar FROM cte1
  ORDER BY rn, x
)
SELECT rn, 
  ARRAY_AGG(un_ind ORDER BY un_ind), 
  CASE
    WHEN ARRAY_REMOVE(ARRAY_AGG(un_tar ORDER BY un_tar), NULL) = '{}' THEN NULL
    ELSE ARRAY_REMOVE(ARRAY_AGG(un_tar ORDER BY un_tar), NULL)
  END
FROM
(
  SELECT
    rn,
    un_ind,
    CASE
      WHEN (cte1.rn, cte1.un_tar) = 
      (
        SELECT rn, x 
        FROM cte2 
        WHERE rn = cte1.rn 
        AND x = cte1.un_tar
      ) THEN NULL
      ELSE un_tar
    END 
  FROM cte1
) AS tab
WHERE un_ind IS NOT NULL
GROUP BY rn;

Result:

rn  array_agg   array_remove
 1       {10}           NULL
 2        {6}            {5}
 3    {4,5,6}           NULL
 4      {5,6}           NULL
 5      {7,8}           {11}
 6       {11}            {5}
 7       {11}            {8}
 8     {9,10}            {6}

Et voilà - the desired result!

Method 3 (Simplified INTARRAY - fiddle):

SELECT
  ind,
  CASE
    WHEN (tar - ind) = '{}' THEN NULL
    ELSE (tar - ind)
  END
FROM
  test;

Same results as above!

A performance analysis (for what it's worth) is here - my simplified INTARRAY appears to be the quickest... ( 1 for question that got me thinking!)...

  • Related