Home > OS >  Undefined function bool_and in Pyspark
Undefined function bool_and in Pyspark

Time:02-17

I am using the code below:

df_logical = df_parcial.groupBy("customer_id", "person_id").agg(
when(expr("bool_and(is_online_store)"), "Online")
.when(expr("bool_and(!is_online_store)"), "Offline")
.when(expr("bool_and(is_online_store)").isNull(), None)
.otherwise("Hybrid").alias("type_person"))

And the rules I have are as follows:

  • If PersonId count(column) has 1 or Trues and None False then Online
  • If PersonId count(column) has 1 or False and None True then offline
  • If the PersonId count(column) has at least 1 False AND 1 True then Hybrid

But when I went to upload the code to production, the following error appears:

Undefined function: 'bool_and'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0

How can I get around this error?

Table original:

customer PersonId is_online_store
afabd2d2 4 true
afabd2d2 8 true
afabd2d2 3 true
afabd2d2 2 false
afabd2d2 4 false

Table as it should be:

customer PersonId type_person
afabd2d2 4 Hybrid
afabd2d2 8 Online
afabd2d2 3 Online
afabd2d2 2 Offline

CodePudding user response:

You get that error because bool_and function only is available since Spark 3. You can achieve the same using conditional count like this:

df_logical = df_parcial.groupBy("customer", "PersonId").agg(
    F.when(
        F.count(F.when(F.col("is_online_store") == "true", 1)) == F.count("*"), "Online"
    ).when(
        F.count(F.when(F.col("is_online_store") == "false", 1)) == F.count("*"), "Offline"
    ).otherwise("Hybrid").alias("New_Column")
)
  • Related