I have a dataframe:
--- --- --- ------
| id|foo|bar|rownum|
--- --- --- ------
| 1|123|123| 1|
| 2|000|236| 1|
| 2|236|236| 2|
| 2|000|236| 3|
| 3|333|234| 1|
| 3|444|444| 2|
--- --- --- ------
I want to add a column match
which will hold the rownum
where foo==bar
, like:
--- --- --- ------ ----
| id|foo|bar|rownum|match
--- --- --- ------ ----
| A|123|123| 1| 1|
| B|000|236| 1| 2|
| B|236|236| 2| 2|
| B|000|236| 3| 2|
| R|333|234| 1| 2|
| R|444|444| 2| 2|
--- --- --- ------ ----
I tried this:
df_grp2 = df_grp2.withColumn('match',when(F.col('foo')==F.col('bar'), F.col('rownum')))
CodePudding user response:
Try using window functions.
from pyspark.sql import functions as F, Window as W
df_grp2 = spark.createDataFrame(
[(1, '123', '123', 1),
(2, '000', '236', 1),
(2, '236', '236', 2),
(2, '000', '236', 3),
(3, '333', '234', 1),
(3, '444', '444', 2)],
['id', 'foo', 'bar', 'rownum']
)
df_grp2 = df_grp2.withColumn(
'match',
F.first(F.when(F.col('foo') == F.col('bar'), F.col('rownum')), True).over(W.partitionBy('id'))
)
df_grp2.show()
# --- --- --- ------ -----
# | id|foo|bar|rownum|match|
# --- --- --- ------ -----
# | 1|123|123| 1| 1|
# | 2|000|236| 1| 2|
# | 2|236|236| 2| 2|
# | 2|000|236| 3| 2|
# | 3|333|234| 1| 2|
# | 3|444|444| 2| 2|
# --- --- --- ------ -----
CodePudding user response:
use this :
df['match'] = df.loc[df['foo'] == df['bar']]['rownum']
But if they are not match it returns 'NAN'
--- --- --- ------ ----
| id|foo|bar|rownum|match
--- --- --- ------ ----
| A|123|123| 1| 1|
| B|000|236| 1| NAN|
| B|236|236| 2| 2|
| B|000|236| 3| NAN|
| R|333|234| 1| NAN|
| R|444|444| 2| 2|
--- --- --- ------ ----