Home > database >  Only keep the month of a date column
Only keep the month of a date column

Time:07-26

I'm using PySpark, and I want to only keep the month of my date column.

This is the format of my column for example: "2022-07-25" (string format). I want to create a new column into my dataset with just "07" for this example.

How can I do that?

CodePudding user response:

F.date_format("date_col", "MM")
from pyspark.sql import functions as F
df = spark.createDataFrame([("2022-07-25",)], ["date_col"])

df = df.withColumn("date_col", F.date_format("date_col", "MM"))

df.show()
#  -------- 
# |date_col|
#  -------- 
# |07      |
#  -------- 

CodePudding user response:

You can additionally use month from pyspark & lpad if required

Data Preparation

s = StringIO("""
date_str
2022-03-01 13:59:50
2022-05-20 13:45:50
2022-06-21 16:59:50
2022-10-22 20:59:50
""")

df = pd.read_csv(s,delimiter=',')

sparkDF = sql.createDataFrame(df)\
             .withColumn('date_parsed',F.to_timestamp(F.col('date_str'), 'yyyy-MM-dd HH:mm:ss'))\
             .drop('date_str')

sparkDF.show()

 ------------------- 
|        date_parsed|
 ------------------- 
|2022-03-01 13:59:50|
|2022-05-20 13:45:50|
|2022-06-21 16:59:50|
|2022-10-22 20:59:50|
 ------------------- 

Extract Month

sparkDF = sparkDF.withColumn("date_month", F.month("date_parsed"))

sparkDF.show()

 ------------------- ---------- 
|        date_parsed|date_month|
 ------------------- ---------- 
|2022-03-01 13:59:50|         3|
|2022-05-20 13:45:50|         5|
|2022-06-21 16:59:50|         6|
|2022-10-22 20:59:50|        10|
 ------------------- ---------- 
  • Related