Pandas to Pyspark environment-CodePudding

newlist = []
for column in new_columns:
    count12 = new_df.loc[new_df[col].diff() == 1]
    new_df2=new_df2.groupby(['my_id','friend_id','family_id','colleage_id']).apply(len)

There is no option is available in pyspark for getting all length of column

How can we achieve this code into pyspark.

Thanks in advance..

CodePudding user response：

Literally, apply(len) is just an aggregation function that would count grouped elements from groupby. You can do the very same thing in basic PySpark syntax

import pyspark.sql.functions as F

(df
    .groupBy('my_id','friend_id','family_id','colleage_id')
    .agg(F.count('*'))
    .show()
)