I am creating a system which pulls data from S3 buckets and Snowflake tables (I also have access to this SF portal). I will be running data quality/data validations against this incoming data inside of a Databricks notebook. My question is when I pull this data in Ill have to stage it somehow to run those DQ checks. Does it make more sense to stage this data inside the Databricks portal or Snowflake portal?
Thanks
What I've researched: databricks snowflake stage and architecture
CodePudding user response:
In general, it’s normally a good idea to hold data as close to where it is being processed as possible. If Databricks is going to be directly processing the data then hold the data in Databricks; if Databricks is going push down processing to Snowflake then hold the data in Snowflake