Home > Software engineering >  PySpark Convert Column<> to Value
PySpark Convert Column<> to Value

Time:05-31

I'd like to know how to get the value of a calculation done using functions such as date_add, datediff, date_sub, etc. The actual value of it in a variable.

As an example:

start_date = '2022-03-06'
end_date = '2022-03-01'

date_lag = datediff(to_date(lit(start_date)), to_date(lit(end_date))) 

If I run date_lag, the output is: Column<'datediff(to_date(2022-03-06), to_date(2022-03-01))'>.

The expected output would be 5.

I was told by a coworker, I'd have to create a dataframe, apply the column expression and then apply a collect to get the value, but I was hoping there would be a simpler way to do it.

CodePudding user response:

You have used PySpark functions datediff, to_date, lit. They all return a column data type. Columns (also the results of your calculations) do not exist unless you add them to a dataframe AND return the dataframe in some way.

So, your colleague was correct telling that first you need to create a dataframe (which will hold your column) and then, since you want your value in a variable, you will have to tell which record from that column you want to take (this can be done using either collect, head, take, first,..)

Creating a dataframe with 3 records and adding your column to it:

from pyspark.sql import functions as F

start_date = '2022-03-06'
end_date = '2022-03-01'
date_lag = F.datediff(F.to_date(F.lit(start_date)), F.to_date(F.lit(end_date)))

df = spark.range(3).select(
    date_lag.alias('column_name')
)
df.show()
#  ----------- 
# |column_name|
#  ----------- 
# |          5|
# |          5|
# |          5|
#  ----------- 

Any of the following lines will write the top row's value of your column into a variable.

date_lag_var = df.head().column_name
date_lag_var = df.first().column_name
date_lag_var = df.take(1)[0].column_name
date_lag_var = df.limit(1).collect()[0].column_name

CodePudding user response:

you can easily do it using python

>>> start_date = '2022-03-06'
>>> end_date = '2022-03-01'

>>> str_d1=start_date.split("-")[0] "/" start_date.split("-")[1] "/" start_date.split("-")[2]
>>> str_d1
'2022/03/06'
>>> str_d2=end_date.split("-")[0] "/" end_date.split("-")[1] "/" end_date.split("-")[2]
>>> str_d2
'2022/03/01'

>>> d1 = datetime.strptime(str_d1, "%Y/%m/%d")
>>> d2 = datetime.strptime(str_d2, "%Y/%m/%d")

>>> delta = d1-d2
>>> delta.days
5
  • Related