Home > Back-end >  How to identify first appearance of items in Pyspark
How to identify first appearance of items in Pyspark

Time:12-01

I have the dataframe provided below. I need to create a new column with label as 'New' for the first occurrence of a product if it appeared only in the year 2021, and 'old' otherwise. I am using pyspark.

Input:

desc  year
0  a  2019
1  a  2020
2  a  2020
3  b  2020
4  b  2019
5  b  2021
6  c  2021
7  a  2021
8  c  2021
9  e  2020

Output:

   desc  year    Label
0  a  2019      old
1  a  2020      old
2  a  2020      old
3  b  2020      old
4  b  2019      old
5  b  2021      old
6  c  2021      New
7  a  2021      old
8  c  2021      New
9  e  2020      old

If the product appeared for the first time in the year 2021 with no history then we need to label the product as New.

Data:

df = spark.createDataFrame(
  [
     ('a','2019')
    ,('a','2020')
    ,('a','2020')
    ,('b','2020')
    ,('b','2019')
    ,('b','2021')
    ,('c','2021')
    ,('a','2021')
    ,('c','2021')
    ,('e','2020')
  ], ['desc', 'year'])

CodePudding user response:

You could compute the min of year over a window partitioned by product (desc), and then check that the result is equal to 2021. Finally, to get the desired label instead of a boolean, you may use the when function.

from pyspark.sql.window import Window
import pyspark.sql.functions as f
win = Window.partitionBy('desc')
df.withColumn("New", f.min(f.col("year")).over(win) == "2021")\
  .withColumn("New", f.when(f.col("New"), "New").otherwise("old"))\
  .show()
 ---- ---- --- 
|desc|year|New|
 ---- ---- --- 
|   a|2019|old|
|   a|2020|old|
|   a|2020|old|
|   a|2021|old|
|   b|2020|old|
|   b|2019|old|
|   b|2021|old|
|   c|2021|New|
|   c|2021|New|
|   e|2020|old|
 ---- ---- --- 

CodePudding user response:

I think I was answered the same question but cannot find. Anyway, you can use the first function to find the first year and just ckeck whether it is 2021 or not.

from pyspark.sql import functions as f
from pyspark.sql import Window
w = Window.partitionBy('desc')

df.withColumn('label', f.when(f.first('year').over(w) == f.lit('2021'), f.lit('new')).otherwise(f.lit('old'))) \
  .show()

 ---- ---- ----- 
|desc|year|label|
 ---- ---- ----- 
|   e|2020|  old|
|   c|2021|  new|
|   c|2021|  new|
|   b|2020|  old|
|   b|2019|  old|
|   b|2021|  old|
|   a|2019|  old|
|   a|2020|  old|
|   a|2020|  old|
|   a|2021|  old|
 ---- ---- ----- 
  • Related