PySpark how do I count distinct in only one column and sort by another column in a DataFrame?-CodePudding

I have a DataFrame with duplicate row in column A that has difference value in column B

Example for my data:

| Column A | Column B |
| -------- | -------- |
| APPLE    | RED      |
| APPLE    | GREEN    |
| GRAPE    | BLACK    |
| BANANA   | RED      |
| BANANA   | BLUE     |
| BANANA   | GREEN    |
| BANANA   | GREEN    |

I want to count distinct in column B and also group and sort by column A

Expected data:

| Column A | Column B |
| -------- | -------- |
| APPLE    | 2        |
| GRAPE    | 1        |
| BANANA   | 3        |

Any pointers on how to approach this problem? Either PySpark or SQL can be used.

CodePudding user response：

This should work:

df \
    .groupBy('Column A') \
    .agg(count_distinct(col('Column B')).alias('Column B'))

CodePudding user response：

Alternative with window function

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank

df = spark.createDataFrame(
   [("APPLE", "RED"),
   ("APPLE", "GREEN"),
   ("GRAPE", "BLACK"),
   ("BANANA", "RED"),
   ("BANANA", "BLUE"),
   ("BANANA", "GREEN"),
   ("BANANA", "GREEN")],
   ["col_a", "col_b"]
)

windowSpec  = Window.partitionBy("col_a").orderBy("col_b")

df.withColumn("dense_rank", dense_rank().over(windowSpec))
.groupBy("col_a").max(
    "dense_rank"
).show(truncate=False)

Here are results

 ------ --------------- 
|col_a |max(dense_rank)|
 ------ --------------- 
|APPLE |2              |
|BANANA|3              |
|GRAPE |1              |
 ------ ---------------