I'd like to know how to get the value of a calculation done using functions such as date_add
, datediff
, date_sub
, etc. The actual value of it in a variable.
As an example:
start_date = '2022-03-06'
end_date = '2022-03-01'
date_lag = datediff(to_date(lit(start_date)), to_date(lit(end_date)))
If I run date_lag
, the output is: Column<'datediff(to_date(2022-03-06), to_date(2022-03-01))'>
.
The expected output would be 5
.
I was told by a coworker, I'd have to create a dataframe, apply the column expression and then apply a collect
to get the value, but I was hoping there would be a simpler way to do it.
CodePudding user response:
You have used PySpark functions datediff
, to_date
, lit
. They all return a column data type. Columns (also the results of your calculations) do not exist unless you add them to a dataframe AND return the dataframe in some way.
So, your colleague was correct telling that first you need to create a dataframe (which will hold your column) and then, since you want your value in a variable, you will have to tell which record from that column you want to take (this can be done using either collect
, head
, take
, first
,..)
Creating a dataframe with 3 records and adding your column to it:
from pyspark.sql import functions as F
start_date = '2022-03-06'
end_date = '2022-03-01'
date_lag = F.datediff(F.to_date(F.lit(start_date)), F.to_date(F.lit(end_date)))
df = spark.range(3).select(
date_lag.alias('column_name')
)
df.show()
# -----------
# |column_name|
# -----------
# | 5|
# | 5|
# | 5|
# -----------
Any of the following lines will write the top row's value of your column into a variable.
date_lag_var = df.head().column_name
date_lag_var = df.first().column_name
date_lag_var = df.take(1)[0].column_name
date_lag_var = df.limit(1).collect()[0].column_name
CodePudding user response:
you can easily do it using python
>>> start_date = '2022-03-06'
>>> end_date = '2022-03-01'
>>> str_d1=start_date.split("-")[0] "/" start_date.split("-")[1] "/" start_date.split("-")[2]
>>> str_d1
'2022/03/06'
>>> str_d2=end_date.split("-")[0] "/" end_date.split("-")[1] "/" end_date.split("-")[2]
>>> str_d2
'2022/03/01'
>>> d1 = datetime.strptime(str_d1, "%Y/%m/%d")
>>> d2 = datetime.strptime(str_d2, "%Y/%m/%d")
>>> delta = d1-d2
>>> delta.days
5