How to use groupby and nth_value at the same time in pyspark?-CodePudding

So, I have a dataset with some repeated data, which I need to remove. For some reason, the data I need is always in the middle:

--> df_apps
DATE       |  APP       |  DOWNLOADS  |  ACTIVE_USERS
______________________________________________________
2021-01-10 |  FACEBOOK  |  1000       |  5000
2021-01-10 |  FACEBOOK  |  20000      |  900000
2021-02-10 |  FACEBOOK  |  9000       |  72000
2021-01-11 |  FACEBOOK  |  4000       |  2000
2021-01-11 |  FACEBOOK  |  40000      |  85000
2021-02-11 |  FACEBOOK  |  1000       |  2000

In pandas, it'd be as simple as df_apps_grouped = df_apps.groupby('DATE').nth_value(1) and I'd get the result bellow:

--> df_apps_grouped
DATE       |  APP       |  DOWNLOADS  |  ACTIVE_USERS
______________________________________________________
2021-01-10 |  FACEBOOK  |  20000      |  900000
2021-01-11 |  FACEBOOK  |  40000      |  85000

But for one specific project, I must use pyspark and I can't get this result on it. Could you please help me with this?

Thanks!

CodePudding user response：

What you are looking for is row_number applied over the a window partitioned by DATE and ordered by DATE, however due to the distributed nature of spark, we can't guarantee that during ordering

2021-01-10 |  FACEBOOK  |  1000       |  5000

will always come before

2021-01-10 |  FACEBOOK  |  20000      |  900000

I would suggest, including a line number if you are reading from a file, and ordering based on the file number. Refer here for achieving this in Spark.

CodePudding user response：

You'll want to do:

from pyspark.sql import Window, functions as F

w = Window.partitionBy('date').orderBy('date')
df = df.withColumn('row_n', F.row_number().over(w)).filter('row_n ==1')

Because of its distributed nature the rows are in random order and row 1 might be different the second time you query it. This is why you need an order by, this will make sure you get the same result every time