I'm using PySpark, and I want to only keep the month of my date column.
This is the format of my column for example: "2022-07-25" (string format). I want to create a new column into my dataset with just "07" for this example.
How can I do that?
CodePudding user response:
F.date_format("date_col", "MM")
from pyspark.sql import functions as F
df = spark.createDataFrame([("2022-07-25",)], ["date_col"])
df = df.withColumn("date_col", F.date_format("date_col", "MM"))
df.show()
# --------
# |date_col|
# --------
# |07 |
# --------
CodePudding user response:
You can additionally use month from pyspark & lpad if required
Data Preparation
s = StringIO("""
date_str
2022-03-01 13:59:50
2022-05-20 13:45:50
2022-06-21 16:59:50
2022-10-22 20:59:50
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('date_parsed',F.to_timestamp(F.col('date_str'), 'yyyy-MM-dd HH:mm:ss'))\
.drop('date_str')
sparkDF.show()
-------------------
| date_parsed|
-------------------
|2022-03-01 13:59:50|
|2022-05-20 13:45:50|
|2022-06-21 16:59:50|
|2022-10-22 20:59:50|
-------------------
Extract Month
sparkDF = sparkDF.withColumn("date_month", F.month("date_parsed"))
sparkDF.show()
------------------- ----------
| date_parsed|date_month|
------------------- ----------
|2022-03-01 13:59:50| 3|
|2022-05-20 13:45:50| 5|
|2022-06-21 16:59:50| 6|
|2022-10-22 20:59:50| 10|
------------------- ----------