I'm using DB Connect 9.1.9. My cluster version is 9.1LTS with a single node (for test purposes).
My data is stored on a S3 as a delta table.
Running the following:
df = spark.sql("select * from <table_name> where runDate >= '2022-01-10 14:00:00' and runDate <= '2022-01-10 15:00:00'")
When I run it with DB Connect I get: 31. When I run it on a Databricks Workspace: 462.
Of course you can't check that numbers, I just wanted to find why do we have a difference.
If I remove the condition on the runDate, I have good results on both platform. So I deduced that it was the "runDate" fault, but I can't find why.
The schema:
StructType(List(StructField(id,StringType,False),
StructField(runDate,TimestampType,true)))
I have the same explain plan on both platform too.
Did I miss something about Timestamp usage ?
Update 1: it is funny, when I'm putting the count() inside the spark.sql("SELECT count(*) ...")
directly I still have 31 rows. It might be the way db-connect translate the query to the cluster.
CodePudding user response:
The problem was the timezone associated to the Spark Session.
Add this after your spark session declaration (in case your dates are stored in UTC):
spark.conf.set("spark.sql.session.timeZone", "UTC")