I want to set a new column based on an IF
condition in which a value is in a select query.
For example,
SELECT
request_id,
charge_click_cnt,
IF(
uuid IN (
SELECT
deviceid
from
t1
where
dt between '20210908'
and '20210915'
),
'shop_user',
'non_shop_user'
) as shop_user
FROM
t2
But it seems some errors like follows.
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class#failAnalysis:41 org.apache.spark.sql.catalyst.analysis.Analyzer#failAnalysis:91 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1#apply:316
So I am wondering if there is a better way that set a condition value like the code before.
Thanks in advances.
CodePudding user response:
If spark allows case statements with an exists clause the below should work.
SELECT request_id,
charge_click_cnt,
CASE WHEN EXISTS(SELECT 1
FROM t1
WHERE dt between '20210908'AND '20210915'
AND uuid=deviceid) THEN 'shop_user'
ELSE 'non_shop_user'
END
FROM t2;
CodePudding user response:
If you want to try spark dataFrame api , may be you can try this:
import org.apache.spark.sql.functions._
val df1 = spark.sql("select deviceid as uuid, 1 as tag from t1 where dt between '20210908' and '20210915'")
val df2 = spark.sql("select requst_id, charge_click_cnt, uuid from t2")
val resultDf = df2.join(df1, Seq("uuid"), "left").withColumn("IF", when(col("a") === 1, "shop_user").otherwise("non_shop_user")).drop("tag")