Home > Software engineering >  Create boolean if all the months in a year are included in a column - Pyspark
Create boolean if all the months in a year are included in a column - Pyspark

Time:03-16

I want to create a boolean column where if a subset of a specific date column contains all the months in a year it returns True.

example:

id      date
a   2021-01-01
a   2021-02-01
...
a   2021-12-01
b   2021-02-01
b   2021-04-01

would look like:

id     date        full_year
a   2021-01-01        yes
a   2021-02-01        yes
...                   ...
a   2021-12-01        yes
b   2021-02-01         no
b   2021-04-01         no

CodePudding user response:

Imports:

from pyspark.sql import functions as F, Window as W

Code:

w = W.partitionBy("id",F.year("date"))
out = (sdf.withColumn("date",F.to_date("date"))
.withColumn("CountYearMOnth",
F.size(F.collect_set(F.date_format("date","yyyyMM")).over(w)))
.withColumn("full_year",F.when(F.col("CountYearMOnth")==12,"yes").otherwise("No"))
.drop("CountYearMOnth")
      )

Logic:

  1. Partition by id and year column and create a window (w)
  2. convert date column to an actual date column (ignore if type is a date column)
  3. collect set over the window (w) and get the size of date column with format yyyymm with the below condition
  4. If size == 12, then assign Yes else assign No

Alternatively you can replace size of collect list with approx count distinct:

w = W.partitionBy("id",F.year("date"))
out = (sdf.withColumn("date",F.to_date("date"))
.withColumn("CountYearMOnth",
F.approx_count_distinct(F.date_format("date","yyyyMM")).over(w))
.withColumn("full_year",F.when(F.col("CountYearMOnth")==12,"yes").otherwise("No"))
.drop("CountYearMOnth")
      )

Sample output:

 --- ---------- --------- 
|id |date      |full_year|
 --- ---------- --------- 
|a  |2021-01-31|yes      |
|a  |2021-02-28|yes      |
|a  |2021-03-31|yes      |
|a  |2021-04-30|yes      |
|a  |2021-05-31|yes      |
|a  |2021-06-30|yes      |
|a  |2021-07-31|yes      |
|a  |2021-08-31|yes      |
|a  |2021-09-30|yes      |
|a  |2021-10-31|yes      |
|a  |2021-11-30|yes      |
|a  |2021-12-31|yes      |
|a  |2022-01-31|No       |
|a  |2022-02-28|No       |
|a  |2022-03-31|No       |
|a  |2022-04-30|No       |
|b  |2021-01-31|No       |
|b  |2021-02-28|No       |
|b  |2021-03-31|No       |
|b  |2021-04-30|No       |
|b  |2021-05-31|No       |
|b  |2021-06-30|No       |
 --- ---------- --------- 
  • Related