Home > Back-end >  Spark SQL: How to assign the same ID for a group of rows
Spark SQL: How to assign the same ID for a group of rows

Time:09-24

I have a dataset with users and their actions:

User Action
John logged in
John did smth
John logged out
John logged in
John did smth
John logged out
Patric logged in
Patric did smth
Patric logged out

I want to assign a session_id for every action between logging in and out:

User Action Session_id
John logged in 1
John did smth 1
John logged out 1
John logged in 2
John did smth 2
John logged out 2
Patric logged in 3
Patric did smth 3
Patric logged out 3

I think I should use lag window function, but I can't figure out how exactly.

CodePudding user response:

This is one of the solutions without using lag or lead; let us assume that ds is our main dataset; first, we have to create a unique ID:

ds = ds.withColumn("id", row_number().over(Window.orderBy("user").partitionBy("user")))

Then we declare these window variables for the sake of clearance:

val sumWindow = Window.partitionBy("user").orderBy("id")
val sessionWindow = Window.orderBy("user", "inter")

Then, we create a cumulative count per user ordering by id, then finally, create the session column:

ds = ds
  .withColumn("inter",
    sum(when(col("action").equalTo("logged in"), 1).otherwise(0)).over(sumWindow)
  )
  .withColumn("session", dense_rank().over(sessionWindow))

Final output:

 ------ ---------- --- ----- ------- 
|user  |action    |id |inter|session|
 ------ ---------- --- ----- ------- 
|John  |logged in |1  |1    |1      |
|John  |did smth  |2  |1    |1      |
|John  |logged out|3  |1    |1      |
|John  |logged in |4  |2    |2      |
|John  |did smth  |5  |2    |2      |
|John  |logged out|6  |2    |2      |
|Patric|logged in |1  |1    |3      |
|Patric|did smth  |2  |1    |3      |
|Patric|logged out|3  |1    |3      |
 ------ ---------- --- ----- ------- 

I kept inter and id so you can see what happens, hope this is useful, good luck!

  • Related