Interval 30 days getting converted to Interval 4 weeks 2 days-CodePudding

I am getting an error in a PySpark code when I am using below query in spark.sql clause, and it is supporting in BigQuery when I run it in BigQuery directly.

df = spark.sql('''SELECT h.src, AVG(h.norm_los) AS 

mbr_avg FROM UM h WHERE h.orig_dt < 

date_sub(current_date,INTERVAL 30 DAY) AND CAST((DATE_ADD(DATE(h.orig_dt), INTERVAL 18 

MONTH)) AS DATE) >= date_sub(current_date,INTERVAL 30 DAY) AND h.src IS NOT NULL GROUP BY 

h.src_cumb_id''')

When I am running the same query using PySpark under dataproc cluster, it is converting the interval 30 days to interval 4 weeks 2 days and I am getting below error.

cannot resolve 'date_sub(current_date(), interval 4 weeks 2 days)' due to data type mismatch: argument 2 requires int type, however, 'interval 4 weeks 2 days' is of calendarinterval type

Note: I am using option("intermediateFormat","orc") as well.

Any help on this would be appreciated

CodePudding user response：

date_sub in Spark SQL 1 and BigQuery 2 are different.

In Spark SQL

date_sub(start_date, num_days) - Returns the date that is num_days before start_date.

Examples:

> SELECT date_sub('2016-07-30', 1);
 2016-07-29

CodePudding user response：

Little bit changes to the above query and now its working as expected.

Updated Query:

spark.sql(SELECT h.src, AVG(h.norm_los) AS 
mbr_avg FROM UM h WHERE h.orig_dt < 
date_sub(current_date,30) AND CAST((add_months(DATE(h.orig_dt),-18)) AS DATE) >= date_sub(current_date,30) AND h.src IS NOT NULL GROUP BY
h.src_cumb_id)