I have a table in snowflake where multiple data columns default empty value come through as 1900-01-01
which I import and then manually change these to null
in R on my machine. However since I am dealing with 30M rows I want to try and do this in snowflake and not my local machine since it takes forever.
I know there is a replace()
function that I can manually reference each column and replace 1900-01-01
with null
however is there a way to reference all columns with data type equal to data and then run this replace()
argument on?
In R we have tidyselect verbs so in a dataframe we can dynamically reference many columns based on patterns in the column name or column type - looking to see if there is something similiar in SQL?
CodePudding user response:
Let's do some magic with Python and Snowpark - as this is a simple way of dealing with multiple columns as the question asks.
But first, let's set up a table where we want to replace one value with null:
create or replace table sample_product_data
as
select 'a' a, 'b' b, 'c' c
union all select 'x', 'this is null', 'z'
Then this is a Python stored procedure in Snowflake that will take any value on that table equals to this is null
, and will replace it with a null:
create or replace temporary procedure replace_this_is_null()
returns VARIANT
language python
runtime_version=3.8
packages=('snowflake-snowpark-python')
handler='main'
as
$$
import snowflake.snowpark as snowpark
def main(session: snowpark.Session):
tbn = 'sample_product_data'
session.table(tbn).replace('this is null', None).write.mode('overwrite').save_as_table(tbn)
return 'done'
$$;
Then you can call it with call replace_this_is_null()
and it will work as expected.
Now, since the question wants to replace a date: Just import datetime
, and instead of a string, compare with datetime.date(1900, 1, 1)
.
CodePudding user response:
You use can do this in Snowflake using R's tidyverse packages which your already familiar with.
The dbplyr package extends the dplyr package to support converting dplyr verbs to their SQL equivalent and executing them in the database. Dbplyr supports Snowflake as a database for in-database execution.
To demonstrate first with the data example provided by Felipe Hoffa.
library(odbc)
library(DBI)
library(dbplyr)
library(dplyr)
library(lubridate)
# Snowflake Database Connection details
server <- "<your snowflake account here>" e.g."demo43.snowflakecomputing.com"
uid <- "<your user name>"
database <- "<your database>"
schema <- "<your schema>"
warehouse <- "<your virtual warehouse>"
pwd <- "<your password>"
# Obtain ODBC Connection
con <- dbConnect(odbc::odbc(),
.connection_string =
sprintf("Driver={Snowflake};server={%s};uid={%s};
pwd={%s};database={%s};schema={%s};warehouse={%s}",
server, uid, pwd, database, schema, warehouse ) ,
timeout = 10)
# Create a tbl referencing felipes sample database table in Snowflake
df_product <- tbl(con, "SAMPLE_PRODUCT_DATA")
# First we will get the data to the client R environment to show dplyr
# functionality running on a local dataframe.
(df_product_local <- df_product %>% collect())
#> #A tibble: 2 × 3
#> A B C
#> <chr> <chr> <chr>
#> 1 a b c
#> 2 x this is null z
Now use dplyr verbs to convert the value 'this is null' to NA on the local dataframe
df_product_local %>% mutate(across(everything(), ~na_if(., 'this is null')))
#> # A tibble: 2 × 3
#> A B C
#> <chr> <chr> <chr>
#> 1 a b c
#> 2 x NA z
and execute the same code replacing the local dataframe for the tbl referencing the Snowflake table
df_product %>% mutate(across(everything(), ~na_if(., 'this is null')))
#> # Source: SQL [2 x 3]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#> A B C
#> <chr> <chr> <chr>
#> 1 a b c
#> 2 x NA z
and if you want to process the transformation in Snowflake and return the cleaned result to your local R environment for further local processing
df_product_cleaned <- df_product %>%
mutate(across(everything(), ~na_if(., 'this is null'))) %>%
collect()
head(df_product_cleaned)
#> # A tibble: 2 × 3
#> A B C
#> <chr> <chr> <chr>
#> 1 a b c
#> 2 x NA z
Now let's apply the same approach to the original date problem you have.
# First we create a table with mixed data; character and date columns.
mix_tblname = "SAMPLE_MIXED"
sql_ct <- sprintf("create or replace table %s as
select 'a' a, 'b' b, 'c' c,
'1900-01-01'::DATE x, '2022-08-17'::DATE y, '1900-01-01'::DATE z
union all
select 'x', 'this is null', 'z',
'2022-08-17'::DATE, '1900-01-01'::DATE, '2022-08-15'::DATE",
mix_tblname )
dbExecute(con, sql_ct)
# And reference the new table with a database tbl
df_mixed <- tbl(con, mix_tblname)
df_mixed_local <- df_mixed %>% collect()
# Check the raw data looks OK
head(df_mixed)
#> # Source: SQL [2 x 6]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#> A B C X Y Z
#> <chr> <chr> <chr> <date> <date> <date>
#> 1 a b c 1900-01-01 2022-08-17 1900-01-01
#> 2 x this is null z 2022-08-17 1900-01-01 2022-08-15
The code below fails because we have columns of mixed type. And the non Date columns cannot be coerced to a DATE
df_mixed %>% mutate(across(everything(), ~na_if(., TO_DATE('1900-01-01', 'YYYY-MM-DD'))))
We could instead implicitly convert all columns to character and evaluate as a character expression.
df_mixed %>% mutate(across(everything(), ~na_if(.,'1900-01-01')))
#> # Source: SQL [2 x 6]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#> A B C X Y Z
#> <chr> <chr> <chr> <date> <date> <date>
#> 1 a b c NA 2022-08-17 NA
#> 2 x this is null z 2022-08-17 NA 2022-08-15
Although this works, it will pick other column types containing the same value, which you may not want. So we need a way of identifying the DATE columns.
Heres the way I can do that on a local dataframe
df_mixed_local %>% mutate(across(where(~ is.Date(.x)), ~na_if(.,'1900-01-01')))
#> # A tibble: 2 × 6
#> A B C X Y Z
#> <chr> <chr> <chr> <date> <date> <date>
#> 1 a b c NA 2022-08-17 NA
#> 2 x this is null z 2022-08-17 NA 2022-08-15
But it doesn't work for a Database tbl. You can see the SQL generated here is clearly missing the column wise transformations.
df_mixed %>% mutate(across(where(~ is.Date(.x)), ~na_if(.,'1900-01-01'))) %>% show_query()
#> <SQL>
#> SELECT *
#> FROM "SAMPLE_MIXED"
I tried a few things but couldn't find a TIDY way of filtering on the Date types so instead...
We can get a vector of the date columns from Snowflakes Information Schema
## Switch session to the Information Schema
dbExecute(con, 'USE SCHEMA INFORMATION_SCHEMA')
dateCols <- tbl(con, 'COLUMNS') %>%
filter(TABLE_CATALOG == database,
TABLE_SCHEMA == schema,
TABLE_NAME == mix_tblname,
DATA_TYPE == 'DATE') %>%
select(COLUMN_NAME) %>%
arrange(ORDINAL_POSITION) %>%
pull()
## Switch session back to our data schema
dbExecute(con, sprintf('USE SCHEMA %s',schema ))
Now using dateCols we can selectively apply our transformation to only the DATE columns
df_mixed %>% mutate(across(all_of(dateCols), ~na_if(.,TO_DATE('1900-01-01', 'YYYY-MM-DD'))))
#> # Source: SQL [2 x 6]
#> # Database: Snowflake 6.28.0[SFIELD@Snowflake/SF_TEST]
#> A B C X Y Z
#> <chr> <chr> <chr> <date> <date> <date>
#> 1 a b c NA 2022-08-17 NA
#> 2 x this is null z 2022-08-17 NA 2022-08-15
If anyone finds the TIDY way of applying a DATE data-type filter over the input columns I'd be interested to see it.