I have the dataframe provided below. I need to create a new column with label as 'New'
for the first occurrence of a product if it appeared only in the year 2021, and 'old'
otherwise. I am using pyspark.
Input:
desc year
0 a 2019
1 a 2020
2 a 2020
3 b 2020
4 b 2019
5 b 2021
6 c 2021
7 a 2021
8 c 2021
9 e 2020
Output:
desc year Label
0 a 2019 old
1 a 2020 old
2 a 2020 old
3 b 2020 old
4 b 2019 old
5 b 2021 old
6 c 2021 New
7 a 2021 old
8 c 2021 New
9 e 2020 old
If the product appeared for the first time in the year 2021 with no history then we need to label the product as New.
Data:
df = spark.createDataFrame(
[
('a','2019')
,('a','2020')
,('a','2020')
,('b','2020')
,('b','2019')
,('b','2021')
,('c','2021')
,('a','2021')
,('c','2021')
,('e','2020')
], ['desc', 'year'])
CodePudding user response:
You could compute the min of year
over a window partitioned by product (desc
), and then check that the result is equal to 2021
. Finally, to get the desired label instead of a boolean, you may use the when
function.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
win = Window.partitionBy('desc')
df.withColumn("New", f.min(f.col("year")).over(win) == "2021")\
.withColumn("New", f.when(f.col("New"), "New").otherwise("old"))\
.show()
---- ---- ---
|desc|year|New|
---- ---- ---
| a|2019|old|
| a|2020|old|
| a|2020|old|
| a|2021|old|
| b|2020|old|
| b|2019|old|
| b|2021|old|
| c|2021|New|
| c|2021|New|
| e|2020|old|
---- ---- ---
CodePudding user response:
I think I was answered the same question but cannot find. Anyway, you can use the first
function to find the first year and just ckeck whether it is 2021
or not.
from pyspark.sql import functions as f
from pyspark.sql import Window
w = Window.partitionBy('desc')
df.withColumn('label', f.when(f.first('year').over(w) == f.lit('2021'), f.lit('new')).otherwise(f.lit('old'))) \
.show()
---- ---- -----
|desc|year|label|
---- ---- -----
| e|2020| old|
| c|2021| new|
| c|2021| new|
| b|2020| old|
| b|2019| old|
| b|2021| old|
| a|2019| old|
| a|2020| old|
| a|2020| old|
| a|2021| old|
---- ---- -----