I have a dataframe with column item_id
Below is the sample dataframe
-----------
|item_id |
-----------
| BA2C31|
| BA2C31|
| B4D456|
| B4D456|
| EDJJ88|
-----------
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(0, 'BA2C31'),
(1, 'BA2C31'),
(2, 'B4D456'),
(3, 'B4D456'),
(4, 'EDJJ88')],
['id', 'item_id'])
I need to create a column with unique values respective to its item_id...each item_id should have a unique value
from pyspark.sql.functions import col, sha2, concat
df.withColumn("u_id", sha2(col("item_id")), 256)).show(10, False)
Desired output:
--------------------
|item_id | u_id
--------------------
| BA2C31| 101
| BA2C31| 101
| B4D456| 102
| B4D456| 102
| EDJJ88| 103
--------------------
I am using withcolumn..but I am not getting the desired output
CodePudding user response:
You can use dense_rank(),
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
df.withColumn("u_id", dense_rank().over(Window.orderBy("item_id")) 100).show()
--- ------- ----
| id|item_id|u_id|
--- ------- ----
| 2| B4D456| 101|
| 3| B4D456| 101|
| 0| BA2C31| 102|
| 1| BA2C31| 102|
| 4| EDJJ88| 103|
--- ------- ----