Create a column with value from another column based on condition-CodePudding

I have a dataframe:

 --- --- --- ------ 
| id|foo|bar|rownum|
 --- --- --- ------ 
|  1|123|123|     1|
|  2|000|236|     1|
|  2|236|236|     2|
|  2|000|236|     3|
|  3|333|234|     1|
|  3|444|444|     2|
 --- --- --- ------

I want to add a column match which will hold the rownum where foo==bar, like:

 --- --- --- ------ ---- 
| id|foo|bar|rownum|match
 --- --- --- ------ ---- 
|  A|123|123|     1|   1|
|  B|000|236|     1|   2|
|  B|236|236|     2|   2|
|  B|000|236|     3|   2|
|  R|333|234|     1|   2|
|  R|444|444|     2|   2|
 --- --- --- ------ ----

I tried this:

df_grp2 = df_grp2.withColumn('match',when(F.col('foo')==F.col('bar'), F.col('rownum')))

CodePudding user response：

Try using window functions.

from pyspark.sql import functions as F, Window as W

df_grp2 = spark.createDataFrame(
    [(1, '123', '123', 1),
     (2, '000', '236', 1),
     (2, '236', '236', 2),
     (2, '000', '236', 3),
     (3, '333', '234', 1),
     (3, '444', '444', 2)],
    ['id', 'foo', 'bar', 'rownum']
)

df_grp2 = df_grp2.withColumn(
    'match',
    F.first(F.when(F.col('foo') == F.col('bar'), F.col('rownum')), True).over(W.partitionBy('id'))
)

df_grp2.show()
#  --- --- --- ------ ----- 
# | id|foo|bar|rownum|match|
#  --- --- --- ------ ----- 
# |  1|123|123|     1|    1|
# |  2|000|236|     1|    2|
# |  2|236|236|     2|    2|
# |  2|000|236|     3|    2|
# |  3|333|234|     1|    2|
# |  3|444|444|     2|    2|
#  --- --- --- ------ -----

CodePudding user response：

use this :

df['match'] = df.loc[df['foo'] == df['bar']]['rownum']

But if they are not match it returns 'NAN'

 --- --- --- ------ ---- 
| id|foo|bar|rownum|match
 --- --- --- ------ ---- 
|  A|123|123|     1|   1|
|  B|000|236|     1| NAN|
|  B|236|236|     2|   2|
|  B|000|236|     3| NAN|
|  R|333|234|     1| NAN|
|  R|444|444|     2|   2|
 --- --- --- ------ ----