Home > Enterprise >  How to remove the 0s in the id_sum column by a sequence from 1 to n in pyspark dataframe
How to remove the 0s in the id_sum column by a sequence from 1 to n in pyspark dataframe

Time:02-02

I have the following pyspark dataframe df_model:

id_client id_sku
1111 4444
1111 4444
2222 6666
2222 6666
3333 777

And i use this code to generate the column id_frecuence:

t = df_model.collect()[0][0]
w = Window.partitionBy('id_client').rowsBetween(Window.unboundedPreceding,0).orderBy('col')
df =df_model.withColumn('id_sum',
 f.sum(f.when(f.col('id_client') !=t  , 1)
 .otherwise(0))
 .over(w))

and my output is:

id_client id_sku id_sum
1111 4444 0
1111 4444 0
2222 6666 1
2222 6666 2
3333 777 1

But i want to obtain the follow result:

id_client id_sku id_sum
1111 4444 1
1111 4444 2
2222 6666 1
2222 6666 2
3333 777 1

My question is whats wrong with the code.

Actually I'm trying to use a Windowfunction and my code is like this:

t = df_model.collect()[0][0]
w = Window.partitionBy('id_client').rowsBetween(Window.unboundedPreceding,0).orderBy('id_sku')
df =df_model.withColumn('id_sum',
 f.sum(f.when(f.col('id_client') !=t  , 1)
 .otherwise(0))
 .over(w))

CodePudding user response:

You can try like this:

df_model.withColumn("id_sum", row_number().over(w))

row_number() window function is used to give the sequential row number starting from 1 to the result of each window partition.

  • Related