I have functions inside of a Databrick notebook that pulls from Snowflake and S3, should the data be-CodePudding

I am creating a system which pulls data from S3 buckets and Snowflake tables (I also have access to this SF portal). I will be running data quality/data validations against this incoming data inside of a Databricks notebook. My question is when I pull this data in Ill have to stage it somehow to run those DQ checks. Does it make more sense to stage this data inside the Databricks portal or Snowflake portal?

Thanks

What I've researched: databricks snowflake stage and architecture

CodePudding user response：

In general, it’s normally a good idea to hold data as close to where it is being processed as possible. If Databricks is going to be directly processing the data then hold the data in Databricks; if Databricks is going push down processing to Snowflake then hold the data in Snowflake