Home > Software design >  Pyspark: how to fix 'could not parse datatype: interval' error
Pyspark: how to fix 'could not parse datatype: interval' error

Time:12-15

I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.

I already had a date_of_birth column available, so I inserted a current_date column with the following code:

import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))

df= df.withColumn("current_date", lit(today))

Displaying my df confirms that it worked. Looks a little something like this:

id date_of_birth current_date
01 1995-01-01 2021-12-2021
02 1987-02-16 2021-12-2021

I inserted the age column by substracting the values of date_of_birth and current_date.

df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))

Cell runs without a problem.

Here's where I'm stuck:

Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:

'could not parse datatype: interval'

I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.

How can I fix this?

Is there a way to display the age in years (int) in this particular scenario?

PS: both the date_of_birth and current_date cols have a date type.

CodePudding user response:

Solved it. Mike's comment helped tons. Thank you!

Here's how I solved it:


# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))

# update data with current_date() function
df  = df .withColumn("current_date", f.current_date())

# insert new column age with dummy data (in this case, 1s)
df  = df .withColumn("age", lit(1))

# update data with months_between() function, divide by 12 to obtain years.
df  = df  .withColumn("age", f.months_between(df.current_date, df  .date_of_birth)/12)

# round and cast as interger to get rid of decimals
df  = df  .withColumn("age", f.round(df["age"]).cast('integer'))

CodePudding user response:

Would use one of the pyspark functions for calculating difference between dates.

pyspark.sql.functions.datediff

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.datediff.html

pyspark.sql.functions.months_between

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.months_between.html

  • Related