Home > Net >  Complicated SQL for update and delete on a large table
Complicated SQL for update and delete on a large table

Time:03-29

I am having a complicated SQL problem in PostgreSQL.

Suppose I have a large table called 'selling_prices'. It contains around 19 million rows. I want to remove some duplicate rows and also update some data. Here is the table structure:

seq customer_co_cd item_sku seliing_tanka_rate updatedate
1 1414343 sku001 0.4 2021-01-18 14:34:48
2 1414343 sku001 0.4 2021-01-18 14:34:48
3 1414343 sku001 0.4 2021-01-16 01:34:48
4 1512333 sku002 0.2 2021-01-16 01:34:48
5 1512333 sku002 0.5 2021-01-16 01:34:48

and so on....

Condition 1: If the customer_co_cd and item_sku and selling_tanka_rate is same update the latest updatedate to '2021/11/12' and delete the other data.

After the SQL table should be like: delete seq(2,3) and update seq 1

seq customer_co_cd item_sku seliing_tanka_rate updatedate
1 1414343 sku001 0.4 2021-11-12 00:00:00

Condition 2: If the (customer_co_cd and item_sku) is same and selling_tanka_rate is different then get the data as group

customer_co_cd item_sku count
1512333 sku002 2

I tried some query using group by but it is slow...

SELECT customer_co_cd, item_sku, COUNT(*) 
FROM selling_prices 
GROUP BY customer_co_cd,item_sku 
HAVING COUNT(*) > 1

I don't know how to query the condition 1. Also what is the efficient way to get condition 2. Keep in mind that there are around 19 million data.

Should I create a script or is there a efficient query I can use.

CodePudding user response:

This should answers your needs (result here)

select * from t1 order by seq;

-- Update first
with t2 as (
  select 
    row_number() over (partition by customer_co_cd,item_sku,seliing_tanka_rate order by customer_co_cd,item_sku,seliing_tanka_rate,seq) as rn,
    lead(seliing_tanka_rate) over (partition by customer_co_cd,item_sku order by customer_co_cd,item_sku,seq) as lead,
    * 
  from t1)
update t1
set updatedate = '20211112'
from t2
where t2.seq = t1.seq and t2.rn = 1
and t2.seliing_tanka_rate = t2.lead;

-- delete to keep the wanted records
with t2 as (select row_number() over (partition by customer_co_cd,item_sku,seliing_tanka_rate order by customer_co_cd,item_sku,seliing_tanka_rate,seq) as rn,* from t1)
delete 
from t1
where seq in (select seq from t2 where rn > 1);

select * from t1 order by seq;

-- Condition 2
with t2 as (
  select *,
  lead(customer_co_cd) over (partition by customer_co_cd,item_sku) as co_cd,
  lead(item_sku) over (partition by customer_co_cd,item_sku) as sku,
  lead(seliing_tanka_rate) over (partition by customer_co_cd,item_sku) as rate
  from t1
  )
select customer_co_cd,item_sku,
count(*) filter (where customer_co_cd = t2.co_cd and item_sku = t2.sku and seliing_tanka_rate <> t2.rate)   1 as count
from t2
group by customer_co_cd,item_sku
having count(*) filter (where customer_co_cd = t2.co_cd and item_sku = t2.sku and seliing_tanka_rate <> t2.rate)   1 > 1

CodePudding user response:

I think Philippe has answered your question but I'll add some slightly different approaches.

create temporary table orders (
    seq serial primary key,
  customer_co_cd int,
  item_sku varchar,
  selling_tanka_rate float,
  updated_at date
);

insert into orders (seq, customer_co_cd, item_sku, selling_tanka_rate, updated_at) values
(1 , 1414343, 'sku001', 0.4, '2021-01-18'),
(2 , 1414343, 'sku001', 0.4, '2021-01-18'),
(3 , 1414343, 'sku001', 0.4, '2021-01-16'),
(4 , 1512333, 'sku002', 0.2, '2021-01-16'),
(5 , 1512333, 'sku002', 0.5, '2021-01-16')
;

with ranked_orders as (
    select 
        orders.*,
      row_number() over(partition by customer_co_cd, item_sku, selling_tanka_rate order by updated_at DESC, seq) as recent_updated_at
   from orders
)
update orders
set updated_at = '2021-11-12' 
from ranked_orders 
where 
    orders.seq = ranked_orders.seq AND 
  ranked_orders.recent_updated_at = 1
;

select * from orders order by seq ASC; 

/*
 seq | customer_co_cd | item_sku | selling_tanka_rate | updated_at
----- ---------------- ---------- -------------------- ------------
   1 |        1414343 | sku001   |                0.4 | 2021-11-12
   2 |        1414343 | sku001   |                0.4 | 2021-01-18
   3 |        1414343 | sku001   |                0.4 | 2021-01-16
   4 |        1512333 | sku002   |                0.2 | 2021-11-12
   5 |        1512333 | sku002   |                0.5 | 2021-11-12
*/

delete from orders 
where orders.updated_at <> '2021-11-12' 

select * from orders order by seq ASC; 

/*
 seq | customer_co_cd | item_sku | selling_tanka_rate | updated_at
----- ---------------- ---------- -------------------- ------------
   1 |        1414343 | sku001   |                0.4 | 2021-11-12
   4 |        1512333 | sku002   |                0.2 | 2021-11-12
   5 |        1512333 | sku002   |                0.5 | 2021-11-12
*/


select t.* from 
( select customer_co_cd, 
         item_sku, 
         count(distinct selling_tanka_rate) as count
  from orders
  group by (customer_co_cd, item_sku)
) as t
where t.count > 1 //  you may want to remove this. Not sure of your exact requirements.

/*
 customer_co_cd | item_sku | count
---------------- ---------- -------
        1512333 | sku002   |     2
*/
  • Related