I have the following pyspark dataframe df_model:
id_client | id_sku |
---|---|
1111 | 4444 |
1111 | 4444 |
2222 | 6666 |
2222 | 6666 |
3333 | 777 |
And i use this code to generate the column id_frecuence:
t = df_model.collect()[0][0]
w = Window.partitionBy('id_client').rowsBetween(Window.unboundedPreceding,0).orderBy('col')
df =df_model.withColumn('id_sum',
f.sum(f.when(f.col('id_client') !=t , 1)
.otherwise(0))
.over(w))
and my output is:
id_client | id_sku | id_sum |
---|---|---|
1111 | 4444 | 0 |
1111 | 4444 | 0 |
2222 | 6666 | 1 |
2222 | 6666 | 2 |
3333 | 777 | 1 |
But i want to obtain the follow result:
id_client | id_sku | id_sum |
---|---|---|
1111 | 4444 | 1 |
1111 | 4444 | 2 |
2222 | 6666 | 1 |
2222 | 6666 | 2 |
3333 | 777 | 1 |
My question is whats wrong with the code.
Actually I'm trying to use a Windowfunction and my code is like this:
t = df_model.collect()[0][0]
w = Window.partitionBy('id_client').rowsBetween(Window.unboundedPreceding,0).orderBy('id_sku')
df =df_model.withColumn('id_sum',
f.sum(f.when(f.col('id_client') !=t , 1)
.otherwise(0))
.over(w))
CodePudding user response:
You can try like this:
df_model.withColumn("id_sum", row_number().over(w))
row_number()
window function is used to give the sequential row number starting from 1 to the result of each window partition.