I have a pyspark dataframe:
Location Month Year
USA 1 2020
USA 1 2021
CAN 1 2020
CAN 2 2020
CAN 1 2021
CAN 2 2021
I need to combine Month and Year columns to create Date column. My expected dataframe should be:
Location Month Year Date
USA 1 2020 1/1/2020
USA 1 2021 1/1/2021
CAN 1 2020 1/1/2020
CAN 2 2020 2/1/2020
CAN 1 2021 1/1/2021
CAN 2 2021 2/1/2021
All the dates should be starting on 1st of every month. I have tried this code:
df = df.select(concat_ws('/',df.month,1,df.year).alias('Month') \
,df["*"])
and getting this error:
Invalid argument, not a string or column
Both Month and Year columns are of long datatype
CodePudding user response:
You get the Invalid argument, not a string or column
because argument 1
in your concat_ws('/', df.month, 1, df.year)
is neither a column or a string (string that should be the name of a column). You can correct it by using lit
built-in function, as follows:
from pyspark.sql import functions as F
df = df.select(F.concat_ws('/', df.month, F.lit(1), df.year).alias('Month'), df["*"])
CodePudding user response:
You could also use withColumn
and concat
.
Example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
data = [
{"Location": "USA", "Month": 1, "Year": 2020},
{"Location": "USA", "Month": 2, "Year": 2020},
]
df = spark.createDataFrame(data=data)
df = df.withColumn(
"Date", F.concat(F.col("Month"), F.lit("/"), F.lit("1"), F.lit("/"), F.col("Year"))
)
Result:
root
|-- Location: string (nullable = true)
|-- Month: long (nullable = true)
|-- Year: long (nullable = true)
|-- Date: string (nullable = true)
-------- ----- ---- --------
|Location|Month|Year|Date |
-------- ----- ---- --------
|USA |1 |2020|1/1/2020|
|USA |2 |2020|2/1/2020|
-------- ----- ---- --------
CodePudding user response:
Or using make_date
function (available since Spark 3):
import pyspark.sql.functions as F
df1 = df.withColumn(
"Date",
F.date_format(F.expr("make_date(Year, Month, 1)"), "MM/dd/yyyy")
)