Home > Software design >  Proper way of updating a whole table in Redshift, drop table create table vs. truncate insert in
Proper way of updating a whole table in Redshift, drop table create table vs. truncate insert in

Time:10-13

Currently I have many tables for which I have to update the information they hold, sometimes on a daily or weekly basis. So far, I've been doing this by a combination of DROP TABLE IF EXIST some_schema.some_table_name; followed by CREATE TABLE some_schema.some_table_name AS ( SELECT ... FROM ... WHERE ...); and I would like to know what is the "best-practice" or proper way of doing so.

I've read that INSERT operations in Redshift are quite expensive, so I've been avoiding its usage, but maybe the use of TRUNCATE with INSERT is better than dropping and creating.

How can I confirm which option is better?

I've seen this article from Redshift docs, but I'm not sure if it is the best option, since I could have not only to remove records, but keeping and inserting as well.

CodePudding user response:

I think you will need to use the Update command. I understand that drooping a table is a risky move, as you might loose all of your data from your database.

Update some_table_name s set
   s.Id="whatever you want to update",
   s.Name="whatever you want to update",
   s.LastName="whatever you want to update",
   s.OtherTableColumn="whatever you want to update"
From
   some_table_name s

In above code I assumed your table had for columns (1-Id, 2-Name, 3-LastName, 4-OtherTableColumn). If you have more or less columns then I would adjust accordingly.

I would also write a update procedure for this (and each table) so if you need to update somewhat frequently you just use the procedure; I think its quicker. Below would be my procedure:

Create Proc sp_UpdateSome_table_name
   @Id int,
   @name nvarchar(255),
   @lastname nvarchar(255),
   @OtherTableColumn int
AS
BEGIN
Update s some_table_name
       
   s.Name="whatever you want to update",
   s.LastName="whatever you want to update",
   s.OtherTableColumn="whatever you want to update"
From
   some_table_name s
Where
   s.Id=@Id
END

You want to make sure that each column in your table is defined with correct data type in the procedure. For example I assumed above that @Id was int, Name was nvarchar(255) etc. If you want to allow yourself not to enter any data (allowing null) in certain table columns when updating then after the data type you can write Null; for example if you write @Id int Null, then you can update is as null; but if you are not sure what this is, simply ignore this sentence for now.

Once you assured above paragraph is good (data types are correct), then select the entire procedure and then execute (F5). This will store this procedure.

Then I will write the procedure every time you want to update your table shown as below:

Exec sp_UpdateSome_table_name 1,John,Smith,77

If you highlight the above command and execute (f5) it then it will update the table which has Id=1 and it will make the name John, last name Smith and the other column 77 from whatever it was before. If there is no data in the table with Id=1 then you can execute.

Keep in mind the last rows of the codes might not have a comma. The above codes are written correctly, just pointing it out as you might put a comma out of habit.

CodePudding user response:

If your desire is to completely erase a table and replace the data then the general pattern you are following in fine. However, there are a few things you should be doing to make things safer / better.

There are 3 patterns to do this and one is clearly the lowest performance. These are Delete/Insert, Truncate/Insert, and Drop/Insert. Of these Delete/Create/Insert is NOT what you want to do from a performance point of view. This process invalidates all the rows in the table (not delete them) and adds new valid rows. This doubles the size of the table, wasting space, and needs to be vacuumed. The only upside of this approach is that it doesn't have the downsides of the other approaches but this only matters in certain situations. Go down this approach only if you have to.

Truncate/Insert is fast and maintains the same table id as the original table. Because truncate operates on the blocks of the table (unlinking them) it is fast but there is some small overhead in managing all the block links. Since the table definition is unchanged all DDL stays defined and dependent views can keep pointing to the table. The downside with truncate is that it forces a COMMIT to occur which means that until the table is repopulated with new data other users of the database can see an empty table. This can lead to incorrect results during these windows. Not good.

Lastly there is Drop/Create/Insert. This approach is marginally (very slightly and only for large tables) faster than truncate in the ideal case. It just throws away the old blocks. There is some additional cost to setting up the new table (of the same name) so truncate and drop are about the same speed unless the table is large. Since Drop can be inside of a transaction block the empty table won't be seen by third parties (if done correctly). The downside with this approach is that the old table and new table are entirely different tables (different oids) - they just happen to have the same name. This means that any dependent (regular) views will need to be dropped and recreated as well. Also since this table is "going away" the commit of the transaction cannot complete until all uses of the table are complete. This becomes a large problem when someone leaves a transaction open in their bench and goes home for the night. Since the tables needs to be recreated your process needs to know the complete and correct DDL for the table.

Hopefully this gives you some idea of when to use these different approaches. Two things I see that could be better in your current code - 1) You are not using a transaction block (as far as I can tell) so there is a window when others will see that the tables doesn't exists or is empty. This may or may not be important to you but be aware. 2) "Create table As" doesn't define the DDL of the table in performant structure (and possibly incorrectly). You should always specify your permanent tables fully. Sort and Dist keys matter as do varchar lengths, data types etc. This is a time bomb waiting to go off.

  • Related