I am working with pyspark. I have a pyspark dataframe with a Year Column and a Month Column where the month is fully spelled out
Ex:
Year | Month |
---|---|
2022 | January |
2021 | December |
I want to make 2 new date column (StartofMonth and EndofMonth) In order to do so I need a) Convert January to a integer, how do I do that? b) I need to concatenate it to the Year and the number 1 to generate a "Start Month" Column eg: 2022-01-01 c) I need to concatenate Year and Integer version of month last day number for that month. eg: 2022-01-31
Is there a way I can do this with a function like unix time stamp and/or from unix time or some other functionality?
sourceDF.withColumn("mnth_num",from_unixtime(unix_timestamp(col("Month"),'%B'),'MM')).show()
I have tried to the above for converting the month to an integer but that did not work.
CodePudding user response:
There's a bunch of pyspark.sql.functions
that can get you there! A possibility is the following:
from pyspark.sql.functions import from_unixtime, unix_timestamp, col, concat_ws, lit, last_day
df = spark.createDataFrame([("2022", "January"), ("2021", "December")], ["Year", "Month"])
df2 = df.withColumn("month_name",from_unixtime(unix_timestamp(col("Month"),'MMMM'),'MM')) \
.withColumn("start_date", concat_ws("-", col("Year"), col("month_name"), lit("01"))) \
.withColumn("end_date", last_day(col("start_date")))
df2.show()
---- -------- ---------- ---------- ----------
|Year| Month|month_name|start_date| end_date|
---- -------- ---------- ---------- ----------
|2022| January| 01|2022-01-01|2022-01-31|
|2021|December| 12|2021-12-01|2021-12-31|
---- -------- ---------- ---------- ----------
So we used:
from_unixtime
andunix_timestamp
to manipulate the date into an integer formatconcat_ws
andlit
to concatenate the string columns against eachother (and adding alit
literal"01"
string for the daylast_day
to calculate the last day in the month of ourstart_date
Hope this helps!
CodePudding user response:
You can create a dataframe containing all months with their respective order, join it to your dataframe, then use last_day function in spark to get the last day:
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [
["2022", "January"],
["2021", "December"],
]
df = spark.createDataFrame(data).toDF("year", "month")
months_data = [["january", "01"], ["february", "02"], ["march", "03"], ["april", "04"], ["may", "05"],
["june", "06"], ["july", "07"], ["august", "08"], ["september", "09"], ["october", "10"],
["november", "11"], ["december", "12"]]
monthsDf = spark.createDataFrame(months_data).toDF("month_lib", "month_number")
result = df.join(monthsDf, lower(df.month) == lower(monthsDf.month_lib), "left") \
.withColumn("start_date", concat(col("year"), lit("-"), col("month_number"), lit("-01"))) \
.withColumn("end_of_month", last_day(col("start_date"))).select("year", "month", "start_date", "end_of_month")
result.show()
CodePudding user response:
I am not aware of any native Pyspark functionality that would allow that, but you can create a mapping for it. Also, if you are only going to use it for concatenation to the year and day, you don't need an integer. You can do something like this:
from pyspark.sql.functions import when, concat, col
df = df.withColumn(
"MonthValue",
when(col("Month") == "January", "01")
.when(col("Month") == "February", "02")
.when(col("Month") == "March", "03")
.when(col("Month") == "April", "04")
.when(col("Month") == "May", "05")
.when(col("Month") == "June", "06")
.when(col("Month") == "July", "07")
.when(col("Month") == "August", "08")
.when(col("Month") == "September", "09")
.when(col("Month") == "October", "10")
.when(col("Month") == "November", "11")
.when(col("Month") == "December", "12")
.otherwise(None)
)
Then you can concatenate it like this:
df = df.withColumn(
"YearMonth",
concat(col("Year"), "-", col("MonthValue"))
)