So, I have a dataset with some repeated data, which I need to remove. For some reason, the data I need is always in the middle:
--> df_apps
DATE | APP | DOWNLOADS | ACTIVE_USERS
______________________________________________________
2021-01-10 | FACEBOOK | 1000 | 5000
2021-01-10 | FACEBOOK | 20000 | 900000
2021-02-10 | FACEBOOK | 9000 | 72000
2021-01-11 | FACEBOOK | 4000 | 2000
2021-01-11 | FACEBOOK | 40000 | 85000
2021-02-11 | FACEBOOK | 1000 | 2000
In pandas, it'd be as simple as df_apps_grouped = df_apps.groupby('DATE').nth_value(1)
and I'd get the result bellow:
--> df_apps_grouped
DATE | APP | DOWNLOADS | ACTIVE_USERS
______________________________________________________
2021-01-10 | FACEBOOK | 20000 | 900000
2021-01-11 | FACEBOOK | 40000 | 85000
But for one specific project, I must use pyspark and I can't get this result on it. Could you please help me with this?
Thanks!
CodePudding user response:
What you are looking for is row_number
applied over the a window partitioned by DATE
and ordered by DATE
, however due to the distributed nature of spark, we can't guarantee that during ordering
2021-01-10 | FACEBOOK | 1000 | 5000
will always come before
2021-01-10 | FACEBOOK | 20000 | 900000
I would suggest, including a line number if you are reading from a file, and ordering based on the file number. Refer here for achieving this in Spark.
CodePudding user response:
You'll want to do:
from pyspark.sql import Window, functions as F
w = Window.partitionBy('date').orderBy('date')
df = df.withColumn('row_n', F.row_number().over(w)).filter('row_n ==1')
Because of its distributed nature the rows are in random order and row 1 might be different the second time you query it. This is why you need an order by, this will make sure you get the same result every time