I want to count and groupby based on count, how I can do that?
For example I want to groupby
col1
on rows whose count
is less than 3.
Col1 col2
"sal" "salis"
"sal" "salmon"
"bar" "baran"
"dar" "darya"
"sal" "sald"
"bar" "bard"
result
groupby collect
"bar" ["baran","bard"]
"dar" ["darya"]
I know the following code gives me count of each element:
df1_us2.groupBy('col1').count().orderBy('count',ascending=False)
and use the following for groupby. but how I can combine these two?
df1_us2.groupBy('col1').agg(collect_list('col2').alias('collect'))
CodePudding user response:
First groupby
and aggregate col2
using collect_list
then filter the rows where size
of list is less than 3
(
df1_us2
.groupby('col1')
.agg(F.collect_list('col2').alias('col2'))
.where(F.size('col2') < 3)
)
----- -----------------
| col1| col2|
----- -----------------
|"bar"|["baran", "bard"]|
|"dar"| ["darya"]|
----- -----------------