I have a DataFrame with duplicate row in column A that has difference value in column B
Example for my data:
| Column A | Column B |
| -------- | -------- |
| APPLE | RED |
| APPLE | GREEN |
| GRAPE | BLACK |
| BANANA | RED |
| BANANA | BLUE |
| BANANA | GREEN |
| BANANA | GREEN |
I want to count distinct in column B and also group and sort by column A
Expected data:
| Column A | Column B |
| -------- | -------- |
| APPLE | 2 |
| GRAPE | 1 |
| BANANA | 3 |
Any pointers on how to approach this problem? Either PySpark or SQL can be used.
CodePudding user response:
This should work:
df \
.groupBy('Column A') \
.agg(count_distinct(col('Column B')).alias('Column B'))
CodePudding user response:
Alternative with window function
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
df = spark.createDataFrame(
[("APPLE", "RED"),
("APPLE", "GREEN"),
("GRAPE", "BLACK"),
("BANANA", "RED"),
("BANANA", "BLUE"),
("BANANA", "GREEN"),
("BANANA", "GREEN")],
["col_a", "col_b"]
)
windowSpec = Window.partitionBy("col_a").orderBy("col_b")
df.withColumn("dense_rank", dense_rank().over(windowSpec))
.groupBy("col_a").max(
"dense_rank"
).show(truncate=False)
Here are results
------ ---------------
|col_a |max(dense_rank)|
------ ---------------
|APPLE |2 |
|BANANA|3 |
|GRAPE |1 |
------ ---------------