I am using the code below:
df_logical = df_parcial.groupBy("customer_id", "person_id").agg(
when(expr("bool_and(is_online_store)"), "Online")
.when(expr("bool_and(!is_online_store)"), "Offline")
.when(expr("bool_and(is_online_store)").isNull(), None)
.otherwise("Hybrid").alias("type_person"))
And the rules I have are as follows:
- If PersonId count(column) has 1 or Trues and None False then Online
- If PersonId count(column) has 1 or False and None True then offline
- If the PersonId count(column) has at least 1 False AND 1 True then Hybrid
But when I went to upload the code to production, the following error appears:
Undefined function: 'bool_and'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0
How can I get around this error?
Table original:
customer | PersonId | is_online_store |
---|---|---|
afabd2d2 | 4 | true |
afabd2d2 | 8 | true |
afabd2d2 | 3 | true |
afabd2d2 | 2 | false |
afabd2d2 | 4 | false |
Table as it should be:
customer | PersonId | type_person |
---|---|---|
afabd2d2 | 4 | Hybrid |
afabd2d2 | 8 | Online |
afabd2d2 | 3 | Online |
afabd2d2 | 2 | Offline |
CodePudding user response:
You get that error because bool_and
function only is available since Spark 3. You can achieve the same using conditional count like this:
df_logical = df_parcial.groupBy("customer", "PersonId").agg(
F.when(
F.count(F.when(F.col("is_online_store") == "true", 1)) == F.count("*"), "Online"
).when(
F.count(F.when(F.col("is_online_store") == "false", 1)) == F.count("*"), "Offline"
).otherwise("Hybrid").alias("New_Column")
)