I am getting an error in a PySpark code when I am using below query in spark.sql clause, and it is supporting in BigQuery when I run it in BigQuery directly.
df = spark.sql('''SELECT h.src, AVG(h.norm_los) AS
mbr_avg FROM UM h WHERE h.orig_dt <
date_sub(current_date,INTERVAL 30 DAY) AND CAST((DATE_ADD(DATE(h.orig_dt), INTERVAL 18
MONTH)) AS DATE) >= date_sub(current_date,INTERVAL 30 DAY) AND h.src IS NOT NULL GROUP BY
h.src_cumb_id''')
When I am running the same query using PySpark under dataproc cluster, it is converting the interval 30 days
to interval 4 weeks 2 days
and I am getting below error.
cannot resolve 'date_sub(current_date(), interval 4 weeks 2 days)' due to data type mismatch: argument 2 requires int type, however, 'interval 4 weeks 2 days' is of calendarinterval type
Note: I am using option("intermediateFormat","orc") as well.
Any help on this would be appreciated
CodePudding user response:
date_sub
in Spark SQL 1 and BigQuery 2 are different.
In Spark SQL
date_sub(start_date, num_days) - Returns the date that is num_days before start_date.
Examples:
> SELECT date_sub('2016-07-30', 1);
2016-07-29
CodePudding user response:
Little bit changes to the above query and now its working as expected.
Updated Query:
spark.sql(SELECT h.src, AVG(h.norm_los) AS
mbr_avg FROM UM h WHERE h.orig_dt <
date_sub(current_date,30) AND CAST((add_months(DATE(h.orig_dt),-18)) AS DATE) >= date_sub(current_date,30) AND h.src IS NOT NULL GROUP BY
h.src_cumb_id)