I have a dataset with users and their actions:
User | Action |
---|---|
John | logged in |
John | did smth |
John | logged out |
John | logged in |
John | did smth |
John | logged out |
Patric | logged in |
Patric | did smth |
Patric | logged out |
I want to assign a session_id for every action between logging in and out:
User | Action | Session_id |
---|---|---|
John | logged in | 1 |
John | did smth | 1 |
John | logged out | 1 |
John | logged in | 2 |
John | did smth | 2 |
John | logged out | 2 |
Patric | logged in | 3 |
Patric | did smth | 3 |
Patric | logged out | 3 |
I think I should use lag window function, but I can't figure out how exactly.
CodePudding user response:
This is one of the solutions without using lag
or lead
; let us assume that ds
is our main dataset; first, we have to create a unique ID:
ds = ds.withColumn("id", row_number().over(Window.orderBy("user").partitionBy("user")))
Then we declare these window variables for the sake of clearance:
val sumWindow = Window.partitionBy("user").orderBy("id")
val sessionWindow = Window.orderBy("user", "inter")
Then, we create a cumulative count per user
ordering by id
, then finally, create the session
column:
ds = ds
.withColumn("inter",
sum(when(col("action").equalTo("logged in"), 1).otherwise(0)).over(sumWindow)
)
.withColumn("session", dense_rank().over(sessionWindow))
Final output:
------ ---------- --- ----- -------
|user |action |id |inter|session|
------ ---------- --- ----- -------
|John |logged in |1 |1 |1 |
|John |did smth |2 |1 |1 |
|John |logged out|3 |1 |1 |
|John |logged in |4 |2 |2 |
|John |did smth |5 |2 |2 |
|John |logged out|6 |2 |2 |
|Patric|logged in |1 |1 |3 |
|Patric|did smth |2 |1 |3 |
|Patric|logged out|3 |1 |3 |
------ ---------- --- ----- -------
I kept inter
and id
so you can see what happens, hope this is useful, good luck!