Home > Software design >  Delete old rows not present in new feed
Delete old rows not present in new feed

Time:10-31

I've a data feed that updates daily and I need to insert it into a table. I'd like to update rows with the same id but with different data, add non existent data and delete rows that don't exist anymore. I'm having problems with the last task, I can't seem to delete the rows from previous runs that don't exist in the new feed anymore.

This is what I have, the "listings" table with some test data:

INSERT INTO public.listings(
id, price, pub_date, rooms, publisher, updated_date)
VALUES 

(1, 100, '2021-05-03', 3, 'mf', '2021-01-01'),
(2, 200, '2021-05-03', 4, 'pmf', '2021-01-01'),
(3, 300, '2021-05-04', 5, 'jorge', '2021-01-01'),
(4, 400, '2021-05-05', 3, 'yuyu', '2021-01-01'),
(5, 500, '2021-05-06', 3, 'papo', '2021-01-01');

A new insert clause (update to the feed):

INSERT INTO listings AS p (id, price, pub_date, rooms, publisher, updated_date)
VALUES
  (1, 125, '2021-04-30', 4, 'mf', '2021-10-30')
, (2, 200, '2021-05-03', 4, 'pmf', '2021-01-01')
, (6, 250, '2021-04-30', 3, 'mf', '2021-10-30')
ON CONFLICT (id)  -- simple clause
DO UPDATE
SET    pub_date = EXCLUDED.pub_date
     , price     = EXCLUDED.price
     , rooms     = EXCLUDED.rooms
     , updated_date = EXCLUDED.updated_date
WHERE (p.pub_date, p.price, p.rooms, p.updated_date) IS DISTINCT FROM
      (EXCLUDED.pub_date, EXCLUDED.price, EXCLUDED.rooms, EXCLUDED.updated_date);

And this is the trigger:

CREATE OR REPLACE FUNCTION trg_prop_backup()
  RETURNS trigger
  LANGUAGE plpgsql AS
$func$
BEGIN
   INSERT INTO old_listings 
   VALUES (OLD.*);
   
   DELETE FROM listings --this sentence generates an error on id duplicity
   WHERE listings.id = OLD.id;
   
   RETURN NEW;
END
$func$;

CREATE TRIGGER prop_listings_upd_bef
BEFORE UPDATE OR DELETE ON listings
FOR EACH ROW EXECUTE FUNCTION trg_prop_backup();

This already works to update the existing rows and insert the new ones, but it doesn't work for the rows that are not existant anymore on the update feed. There's something I'm doing wrong (or not doing) but I can't seem to come up with the solution.

The "old_listings" table contains the rows that doesn't exist on the new feed, because the data is updated or deleted.

The result of running the two insert queries has to be (for the listings table):

  (1, 125, '2021-04-30', 4, 'mf', '2021-10-30')
, (2, 200, '2021-05-03', 4, 'pmf', '2021-01-01')
, (6, 250, '2021-04-30', 3, 'mf', '2021-10-30')

And the old_listings table:

(1, 100, '2021-05-03', 3, 'mf', '2021-01-01'),
(3, 300, '2021-05-04', 5, 'jorge', '2021-01-01'),
(4, 400, '2021-05-05', 3, 'yuyu', '2021-01-01'),
(5, 500, '2021-05-06', 3, 'papo', '2021-01-01');

CodePudding user response:

Here is my suggestion:

  • Add an extra timestamp field series_ts in table listings and set its value to current_timestamp upon insert or update;
  • remove the where clause of your insert on conflict update query so that series_ts will be updated even for equivalent records;
  • delete all listings records whose series_ts is not equal to current_timestamp in the same transaction with the insert query.

In order to do this first create a 'before insert or update for each row' trigger that executes this function:

create function set_listings_series_ts() returns trigger language plpgsql as $$
begin
 new.series_ts := current_timestamp;
 return new;
end; $$;

After you have this in place run:

BEGIN TRANSACTION;

INSERT INTO listings AS p (id, price, pub_date, rooms, publisher, updated_date)
VALUES
  (1, 125, '2021-04-30', 4, 'mf', '2021-10-30')
, (2, 200, '2021-05-03', 4, 'pmf', '2021-01-01')
, (6, 250, '2021-04-30', 3, 'mf', '2021-10-30')
ON CONFLICT (id)  -- simple clause
DO UPDATE
SET    pub_date = EXCLUDED.pub_date
     , price     = EXCLUDED.price
     , rooms     = EXCLUDED.rooms
     , updated_date = EXCLUDED.updated_date; -- your query w/o WHERE

WITH t as 
(
  delete from listings 
  where series_ts < current_timestamp 
  returning *
)
insert into old_listings select * from t; -- delete old data, save in old_listings

COMMIT TRANSACTION;

You may do this w/o a trigger too if you change the 'insert ... update' query.

CodePudding user response:

My suggestion :

First delete all the rows in the table listings, and create the corresponding rows in the table old_listings :

WITH list AS
(  DELETE FROM listings
    WHERE true
  RETURNING id, price, pub_date, rooms, publisher, updated_date
)
INSERT INTO old_listings (id, price, pub_date, rooms, publisher, updated_date)
SELECT id, price, pub_date, rooms, publisher, updated_date
FROM list ;

Then insert the new rows in table listings and delete the corresponding rows in the table old_listing by trigger :

INSERT INTO listings AS p (id, price, pub_date, rooms, publisher, updated_date)
VALUES
  (1, 125, '2021-04-30', 4, 'mf', '2021-10-30')
, (2, 200, '2021-05-03', 4, 'pmf', '2021-01-01')
, (6, 250, '2021-04-30', 3, 'mf', '2021-10-30')

CREATE OR REPLACE FUNCTION trg_del_bacckup()
  RETURNS trigger
  LANGUAGE plpgsql AS
$func$
BEGIN
   DELETE FROM old_listings
   WHERE id = NEW.id;
   RETURN NEW;
END
$func$;

CREATE TRIGGER trg_del_bacckup
AFTER INSERT ON listings
FOR EACH ROW EXECUTE FUNCTION trg_del_bacckup();

PS : you also have to manage the conflicts in the table old_listings when successive feed try to insert rows with ids which already exist

  • Related