How to remove the 0s in the id_sum column by a sequence from 1 to n in pyspark dataframe-CodePudding

I have the following pyspark dataframe df_model:

id_client	id_sku
1111	4444
1111	4444
2222	6666
2222	6666
3333	777

And i use this code to generate the column id_frecuence:

t = df_model.collect()[0][0]
w = Window.partitionBy('id_client').rowsBetween(Window.unboundedPreceding,0).orderBy('col')
df =df_model.withColumn('id_sum',
 f.sum(f.when(f.col('id_client') !=t  , 1)
 .otherwise(0))
 .over(w))

and my output is:

id_client	id_sku	id_sum
1111	4444	0
1111	4444	0
2222	6666	1
2222	6666	2
3333	777	1

But i want to obtain the follow result:

id_client	id_sku	id_sum
1111	4444	1
1111	4444	2
2222	6666	1
2222	6666	2
3333	777	1

My question is whats wrong with the code.

Actually I'm trying to use a Windowfunction and my code is like this:

t = df_model.collect()[0][0]
w = Window.partitionBy('id_client').rowsBetween(Window.unboundedPreceding,0).orderBy('id_sku')
df =df_model.withColumn('id_sum',
 f.sum(f.when(f.col('id_client') !=t  , 1)
 .otherwise(0))
 .over(w))

CodePudding user response：

You can try like this:

df_model.withColumn("id_sum", row_number().over(w))

row_number() window function is used to give the sequential row number starting from 1 to the result of each window partition.