Home > Blockchain >  DB Connect and workspace notebooks returns different results
DB Connect and workspace notebooks returns different results

Time:02-17

I'm using DB Connect 9.1.9. My cluster version is 9.1LTS with a single node (for test purposes).

My data is stored on a S3 as a delta table.

Running the following:

df = spark.sql("select * from <table_name> where runDate >= '2022-01-10 14:00:00' and runDate <= '2022-01-10 15:00:00'")

When I run it with DB Connect I get: 31. When I run it on a Databricks Workspace: 462.

Of course you can't check that numbers, I just wanted to find why do we have a difference.

If I remove the condition on the runDate, I have good results on both platform. So I deduced that it was the "runDate" fault, but I can't find why.

The schema:

StructType(List(StructField(id,StringType,False), 
                StructField(runDate,TimestampType,true)))

I have the same explain plan on both platform too.

Did I miss something about Timestamp usage ?

Update 1: it is funny, when I'm putting the count() inside the spark.sql("SELECT count(*) ...") directly I still have 31 rows. It might be the way db-connect translate the query to the cluster.

CodePudding user response:

The problem was the timezone associated to the Spark Session.

Add this after your spark session declaration (in case your dates are stored in UTC):

spark.conf.set("spark.sql.session.timeZone", "UTC")
  • Related