Home > Software design >  create a new column with unique value respective to other value in pyspark
create a new column with unique value respective to other value in pyspark

Time:11-22

I have a dataframe with column item_id

Below is the sample dataframe

 ----------- 
|item_id    |
 ----------- 
|     BA2C31|
|     BA2C31|
|     B4D456|
|     B4D456|
|     EDJJ88|
 ----------- 

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(0, 'BA2C31'),
     (1, 'BA2C31'),
     (2, 'B4D456'),
     (3, 'B4D456'),
     (4, 'EDJJ88')],
    ['id', 'item_id'])

I need to create a column with unique values respective to its item_id...each item_id should have a unique value

from pyspark.sql.functions import col, sha2, concat

df.withColumn("u_id", sha2(col("item_id")), 256)).show(10, False)

Desired output:

 -------------------- 
|item_id    | u_id
 -------------------- 
|     BA2C31| 101
|     BA2C31| 101
|     B4D456| 102
|     B4D456| 102
|     EDJJ88| 103
 -------------------- 

I am using withcolumn..but I am not getting the desired output

CodePudding user response:

You can use dense_rank(),

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank

df.withColumn("u_id", dense_rank().over(Window.orderBy("item_id"))   100).show()

 --- ------- ---- 
| id|item_id|u_id|
 --- ------- ---- 
|  2| B4D456| 101|
|  3| B4D456| 101|
|  0| BA2C31| 102|
|  1| BA2C31| 102|
|  4| EDJJ88| 103|
 --- ------- ---- 
  • Related