Home > Software design >  How to filter by count after groupby in Pyspark dataframe?
How to filter by count after groupby in Pyspark dataframe?

Time:01-11

I have a pyspark dataframe like this.

data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.show()
 --- ---- 
| id|name|
 --- ---- 
|  1|   a|
|  2|   a|
|  3|   b|
|  4|   a|
 --- ---- 

I group by this dataframe by name column.

df.groupBy("name").count().show()
 ---- ----- 
|name|count|
 ---- ----- 
|   a|    3|
|   b|    1|
 ---- ----- 

Now, after I groupby the dataframe, I am trying to filter the names that their count is lower than 3. For example, here I am looking to get something like this:

 ---- ----- 
|name|count|
 ---- ----- 
|   b|    1|
 ---- ----- 

CodePudding user response:

try this:

from pyspark.sql import functions as F

data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.groupBy("name").count().where(F.col('count') < 3).show()

F is the alias of functions, you can use any identifier you want, but it is usually written as F or func, which is just a personal habit.

result:

 ---- ----- 
|name|count|
 ---- ----- 
|   b|    1|
 ---- ----- 
  • Related