How to analyse row's with similar ID's in PySpark?-CodePudding

I have a very large Dataset (160k rows). I want to analyse each subset of rows with the same ID. I only care about subsets with the same ID that are at least 30rows long.

What approach should I use?

I did the same task in R and did the following (from what it seems that can't be translated to pyspark): Order by ascending order. check whether next row is same as current, if yes n=n 1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.

CodePudding user response：

One easy method is to group by 'ID' and collect the columns that are needed for your analysis.

If just one column:

grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))

If you need multiple columns, you can use struct:

grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))

Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis

 grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))