Home > OS >  Combine month and year columns to create date column
Combine month and year columns to create date column

Time:11-23

I have a pyspark dataframe:

Location    Month   Year
USA         1       2020
USA         1       2021
CAN         1       2020
CAN         2       2020
CAN         1       2021
CAN         2       2021

I need to combine Month and Year columns to create Date column. My expected dataframe should be:

Location    Month   Year    Date
USA         1       2020    1/1/2020
USA         1       2021    1/1/2021    
CAN         1       2020    1/1/2020
CAN         2       2020    2/1/2020
CAN         1       2021    1/1/2021
CAN         2       2021    2/1/2021

All the dates should be starting on 1st of every month. I have tried this code:

df = df.select(concat_ws('/',df.month,1,df.year).alias('Month') \
                  ,df["*"])

and getting this error:

Invalid argument, not a string or column

Both Month and Year columns are of long datatype

CodePudding user response:

You get the Invalid argument, not a string or column because argument 1 in your concat_ws('/', df.month, 1, df.year) is neither a column or a string (string that should be the name of a column). You can correct it by using lit built-in function, as follows:

from pyspark.sql import functions as F

df = df.select(F.concat_ws('/', df.month, F.lit(1), df.year).alias('Month'), df["*"])

CodePudding user response:

You could also use withColumn and concat.

Example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
data = [
    {"Location": "USA", "Month": 1, "Year": 2020},
    {"Location": "USA", "Month": 2, "Year": 2020},
]
df = spark.createDataFrame(data=data)
df = df.withColumn(
    "Date", F.concat(F.col("Month"), F.lit("/"), F.lit("1"), F.lit("/"), F.col("Year"))
)

Result:

root
 |-- Location: string (nullable = true)
 |-- Month: long (nullable = true)
 |-- Year: long (nullable = true)
 |-- Date: string (nullable = true)

 -------- ----- ---- --------                                                   
|Location|Month|Year|Date    |
 -------- ----- ---- -------- 
|USA     |1    |2020|1/1/2020|
|USA     |2    |2020|2/1/2020|
 -------- ----- ---- -------- 

CodePudding user response:

Or using make_date function (available since Spark 3):

import pyspark.sql.functions as F

df1 = df.withColumn(
    "Date",
    F.date_format(F.expr("make_date(Year, Month, 1)"), "MM/dd/yyyy")
)
  • Related