Home > Mobile >  how I can use count and groupby at the same time?
how I can use count and groupby at the same time?

Time:08-27

I want to count and groupby based on count, how I can do that?

For example I want to groupby col1 on rows whose count is less than 3.

Col1          col2
"sal"         "salis"
"sal"         "salmon"
"bar"         "baran"
"dar"         "darya"
"sal"         "sald"
"bar"         "bard"

result

groupby   collect
"bar"      ["baran","bard"]
"dar"      ["darya"]

I know the following code gives me count of each element:

df1_us2.groupBy('col1').count().orderBy('count',ascending=False)

and use the following for groupby. but how I can combine these two?

df1_us2.groupBy('col1').agg(collect_list('col2').alias('collect'))

CodePudding user response:

First groupby and aggregate col2 using collect_list then filter the rows where size of list is less than 3

(
    df1_us2
    .groupby('col1')
    .agg(F.collect_list('col2').alias('col2'))
    .where(F.size('col2') < 3)
)

 ----- ----------------- 
| col1|             col2|
 ----- ----------------- 
|"bar"|["baran", "bard"]|
|"dar"|        ["darya"]|
 ----- ----------------- 
  • Related