Home > Net >  How to shift a column based on other columns in pyspark
How to shift a column based on other columns in pyspark

Time:07-15

I need to shift Col4 left based on columns Col2 and Col3 in the dataframe in pyspark. Col4 value should be changed when consecutive col2 changes. Col3 mainly track new sequence of Col2 value. Col1 should also partition the final output. The output should look like shift_col4.

ID  Col1 Col2 Col3 Col4 shift_col4
1    1   10   1    4     null
2    1   11   1    8     4
3    1   12   1   12     8
4    1   1    2   16     12
5    1   3    2   20     16
4    2   1    1   16     null
5    2   4    1   20     16

CodePudding user response:

I don't know if you need just to see if Col2 changes to shift (push) Col4, need more info, but I did this based on every time Col2 changes it will push the value of Col4 foward.

w = Window.partitionBy('Col1').orderBy('ID', 'Col1')
df = df.withColumn('shift_4', f.when(f.lag('Col2').over(w) != f.col('Col2'), f.lag('Col4').over(w)))
df.show()
 --- ---- ---- ---- ---- ------- 
| ID|Col1|Col2|Col3|Col4|shift_4|
 --- ---- ---- ---- ---- ------- 
|  1|   1|  10|   1|   4|   null|
|  2|   1|  11|   1|   8|      4|
|  3|   1|  12|   1|  12|      8|
|  4|   1|   1|   2|  16|     12|
|  5|   1|   2|   2|  20|     16|
|  4|   2|   1|   1|  16|   null|
|  5|   2|   2|   1|  20|     16|
 --- ---- ---- ---- ---- ------- 

If you need to keep track of Col2 sequences and just change on the break of the change, you can make another window function and use rank to see it, and then include it on the logic of when function. In this case you can look into this question, it might help you: Counting consecutive occurrences of a specific value in PySpark

  • Related