Home > other >  How to analyse row's with similar ID's in PySpark?
How to analyse row's with similar ID's in PySpark?

Time:11-04

I have a very large Dataset (160k rows). I want to analyse each subset of rows with the same ID. I only care about subsets with the same ID that are at least 30rows long.

What approach should I use?

I did the same task in R and did the following (from what it seems that can't be translated to pyspark): Order by ascending order. check whether next row is same as current, if yes n=n 1, if no i do my analysis and save the results. Rinse and Repeat for the whole lenght of the Data frame.

CodePudding user response:

One easy method is to group by 'ID' and collect the columns that are needed for your analysis.

  1. If just one column:

    grouped_df = original_df.groupby('ID').agg(F.collect_list("column_m")).alias("for_analysis"))

  2. If you need multiple columns, you can use struct:

    grouped_df = original_df.groupby('ID').agg(F.collect_list(F.struct("column_m", "column_n", "column_o")).alias("for_analysis"))
    

Then, once you have your data per ID, you can use a UDF to perform your elaborate analysis

 grouped_df = grouped_df.withColumn('analysis_result', analyse_udf('for_analysis', ...))
  • Related