how to count field with condition by spark-CodePudding

I have a dataframe, there is a enum field(value are 0 or 1) named A, another one field B, I would like to implement below scenario:

if `B` is null:
   count(when `A` is 0) and set a column name `xx` 
   count(when `A` is 1) and set a column name `yy` 
if `B` is not null:
   count(when `A` is 0) and set a column name `zz` 
   count(when `A` is 1) and set a column name `mm`

how can I do it by spark scala?

CodePudding user response：

This will probably be updated

As far as I could understood, this is a grouping scenario, at first I though that only one case could happen (for instance, B is null and A is 0, since you're renaming columns). but since you want both count(A is 0) AND count(A is 1), this might be a grouping scenario, let me know if I'm wrong. So based on my assumptions:

val originalResult = df.groupBy(col("B"), col("A")).count("A")
// your renaming logic here, for instance:
// query on your dataframe to know what is the count of A when B is null and A is 0
val bIsNullAndAIs0Count = ??? // query result
val finalDF = if (bIsNullAndAIs0Count)
  originalResult.withColumnRenamed("count", "xx")
  else // some other logic

CodePudding user response：

It's possible to conditionally populate columns in this way, however the final output DataFrame requires an expected schema.

Assuming all of the scenarios you detailed are possible in one DataFrame, I would suggest creating each of the four columns: "xx", "yy", "zz" and "mm" and conditionally populating them.

In the below example I've populated the values with either "found" or "", primarily to make it easy to see where the values are populated. Using true and false here, or another enum, would likely make more sense in the real world.

Starting with a DataFrame (since you didn't specify the type that "B" is I have gone for a Option[String] (nullable) for this example:

val df = List(
    (0, None),
    (1, None),
    (0, Some("hello")),
    (1, Some("world"))
).toDF("A", "B")
     
 df.show(false)

gives:

 --- ----- 
|A  |B    |
 --- ----- 
|0  |null |
|1  |null |
|0  |hello|
|1  |world|
 --- -----

and to create the columns:

df
    .withColumn("xx", when(col("B").isNull && col("A") === 0, "found").otherwise(""))
    .withColumn("yy", when(col("B").isNull && col("A") === 1, "found").otherwise(""))
    .withColumn("zz", when(col("B").isNotNull && col("A") === 0, "found").otherwise(""))
    .withColumn("mm", when(col("B").isNotNull && col("A") === 1, "found").otherwise(""))
    .show(false)

gives:

 --- ----- ----- ----- ----- ----- 
|A  |B    |xx   |yy   |zz   |mm   |
 --- ----- ----- ----- ----- ----- 
|0  |null |found|     |     |     |
|1  |null |     |found|     |     |
|0  |hello|     |     |found|     |
|1  |world|     |     |     |found|
 --- ----- ----- ----- ----- -----