I have a dataframe having Column Name as 'YEAR',i want to check if the alternate rows of the column are matching and update another Column 'FLAG' with value as 100 if the alternate value matches.
df_prod
Year FLAG
2020 None
2020 None
2019 None
2021 None
2021 None
2022 None
Expected Output **
Year FLAG
2019 None
2020 None
2020 100
2021 None
2021 100
2022 None
**
CodePudding user response:
The following snippet, which uses Windowing function, should do that for you:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, lag, when
df = spark.createDataFrame([(2020, None), (2020, None), (2019, None), (2021, None), (2021, None), (2022, None)], "Year: int, FLAG: int")
window = Window.partitionBy().orderBy("Year")
df.withColumn("FLAG", when(col("Year") == lag(col("Year")).over(window), 100)).show()
---- ----
|Year|FLAG|
---- ----
|2019|null|
|2020|null|
|2020| 100|
|2021|null|
|2021| 100|
|2022|null|
---- ----