I've seen similar questions for other DBs but not sure if they apply to Redshift.
I have an update statement like this and need to apply to many columns:
update t1 set col1 = '' where col1 is null;
Right now I'm doing this for 30 columns. Is there a way to do dynamically or from a list?
Can I do in a stored procedure or function?
CodePudding user response:
The problem here is two fold. You want to find the list of text columns AND you don't want to run N updates as this could create a lot of invalid rows, loss of space, and need to vacuum.
This second concern is because Redshift is a columnar database and doesn't really UPDATE rows, it invalidates the existing row and adds a new row. If you are cleaning up NULLs in a lot of rows in a large table this can make for many invalid rows which need to be vacuumed. This is made worse if you are running many UPDATEs, one per column, as each UPDATE could be making a new version (with invalid rows left behind) of the same row with each UPDATE. In this case you want to run only one UPDATE that updates all the columns and all the rows at once is best. (code below)
However, if you are updating only a few rows in the table (low NULL count) then UPDATING only those rows that have NULLs will be better. In this situation you do want add a WHERE clause to prevent making exact copies of rows that don't have NULLs. This is simple enough to do:
Setting up a test case:
create table fred (
col_a varchar(8),
col_b varchar(8),
col_c varchar(8));
insert into fred values
('a', NULL, NULL),
(NULL, 'b', NULL),
(NULL, NULL, 'c');
Now if we want to update all columns for all rows we can run:
update fred set
col_a = nvl(col_a,''),
col_b = nvl(col_b,''),
col_c = nvl(col_c,'')
where col_a is null or col_b is null or col_c is null;
Moving back to the first part of the question - how to generate a column list and dynamically make this query?. I like to have external DDL definition files for Redshift as this makes linking Redshift data and other AWS data easier. One source file from which Redshift DDL and Athena DDL and any other DDL can be generated. But this isn't likely your situation.
Table DDL is accessible in Redshift in a number of systems tables. The easiest to access in this case is likely information_schema.columns. Like:
select column_name, data_type from information_schema.columns where table_schema = 'public' and table_name = 'fred';
This will give you the names and data types for the columns in any table. Since you are setting the value to '' this process will only work for text data types so you may need an additional WHERE clause for the data types you are looking to change.
Now with the column list in hand you will need to create the query above. This is simply looping through the columns and generating the string parts of the SELECT portion and the WHERE portion of the query above. These a large number of ways to do this. I personally like jinja2 for applying lists to templates to produce configured text (like a query). I also like to keep Redshift focused on performing heavy analytic queries and not have it be an operating environment so I lean away from stored procedures. However, there is no reason why this process cannot be coded into a stored procedure if you like. If I needed to run this data cleansing process regularly I'd code it up as part of my ETL (ETL tool issues the compiled query) or a Lambda function. But this is just me.