Home > OS >  PySpark Data Frame, Convert Full Month Name to Int and then concat with year column and Day number t
PySpark Data Frame, Convert Full Month Name to Int and then concat with year column and Day number t

Time:12-21

I am working with pyspark. I have a pyspark dataframe with a Year Column and a Month Column where the month is fully spelled out

Ex:

Year Month
2022 January
2021 December

I want to make 2 new date column (StartofMonth and EndofMonth) In order to do so I need a) Convert January to a integer, how do I do that? b) I need to concatenate it to the Year and the number 1 to generate a "Start Month" Column eg: 2022-01-01 c) I need to concatenate Year and Integer version of month last day number for that month. eg: 2022-01-31

Is there a way I can do this with a function like unix time stamp and/or from unix time or some other functionality?

sourceDF.withColumn("mnth_num",from_unixtime(unix_timestamp(col("Month"),'%B'),'MM')).show()

I have tried to the above for converting the month to an integer but that did not work.

CodePudding user response:

There's a bunch of pyspark.sql.functions that can get you there! A possibility is the following:

from pyspark.sql.functions import from_unixtime, unix_timestamp, col, concat_ws, lit, last_day

df = spark.createDataFrame([("2022", "January"), ("2021", "December")], ["Year", "Month"])

df2 = df.withColumn("month_name",from_unixtime(unix_timestamp(col("Month"),'MMMM'),'MM')) \
        .withColumn("start_date", concat_ws("-", col("Year"), col("month_name"), lit("01"))) \
        .withColumn("end_date", last_day(col("start_date")))

df2.show()
 ---- -------- ---------- ---------- ----------                                                                                                                                                                                                                                 
|Year|   Month|month_name|start_date|  end_date|                                                                                                                                                                                                                                
 ---- -------- ---------- ---------- ----------                                                                                                                                                                                                                                 
|2022| January|        01|2022-01-01|2022-01-31|                                                                                                                                                                                                                                
|2021|December|        12|2021-12-01|2021-12-31|                                                                                                                                                                                                                                
 ---- -------- ---------- ---------- ---------- 

So we used:

  • from_unixtime and unix_timestamp to manipulate the date into an integer format
  • concat_ws and lit to concatenate the string columns against eachother (and adding a lit literal "01" string for the day
  • last_day to calculate the last day in the month of our start_date

Hope this helps!

CodePudding user response:

You can create a dataframe containing all months with their respective order, join it to your dataframe, then use last_day function in spark to get the last day:

spark = SparkSession.builder.master("local[*]").getOrCreate()

data = [
    ["2022", "January"],
    ["2021", "December"],
]
df = spark.createDataFrame(data).toDF("year", "month")

months_data = [["january", "01"], ["february", "02"], ["march", "03"], ["april", "04"], ["may", "05"],
               ["june", "06"], ["july", "07"], ["august", "08"], ["september", "09"], ["october", "10"],
               ["november", "11"], ["december", "12"]]
monthsDf = spark.createDataFrame(months_data).toDF("month_lib", "month_number")

result = df.join(monthsDf, lower(df.month) == lower(monthsDf.month_lib), "left") \
    .withColumn("start_date", concat(col("year"), lit("-"), col("month_number"), lit("-01"))) \
    .withColumn("end_of_month", last_day(col("start_date"))).select("year", "month", "start_date", "end_of_month")

result.show()

CodePudding user response:

I am not aware of any native Pyspark functionality that would allow that, but you can create a mapping for it. Also, if you are only going to use it for concatenation to the year and day, you don't need an integer. You can do something like this:

from pyspark.sql.functions import when, concat, col

df = df.withColumn(
    "MonthValue",
    when(col("Month") == "January", "01")
    .when(col("Month") == "February", "02")
    .when(col("Month") == "March", "03")
    .when(col("Month") == "April", "04")
    .when(col("Month") == "May", "05")
    .when(col("Month") == "June", "06")
    .when(col("Month") == "July", "07")
    .when(col("Month") == "August", "08")
    .when(col("Month") == "September", "09")
    .when(col("Month") == "October", "10")
    .when(col("Month") == "November", "11")
    .when(col("Month") == "December", "12")
    .otherwise(None)
)

Then you can concatenate it like this:

df = df.withColumn(
    "YearMonth",
    concat(col("Year"), "-", col("MonthValue"))
)
  • Related