Compute the difference in days between the max and the min of a date type column-CodePudding

My table loaded in PySpark has a column "Date" with the following type of data:

Date	Open	High	Low	Close	Volume	Adj Close
1/3/2012	59.97	61.06	59.87	60.33	12668800	52.61923
1/4/2012	60.21	60.35	59.47	59.71	9593300	52.07848
1/5/2012	59.35	59.62	58.37	59.42	12768200	51.82554
1/6/2012	59.42	59.45	58.87	59	8069400	51.45922

How do I calculate the difference, in days, between the max and the min of the column (so in the example above, I need the difference in day between 1/6/2012 and 1/3/2012

CodePudding user response：

Test data:

from pyspark.sql import functions as F

df = spark.createDataFrame([('2012-01-03',),('2013-02-03',),('2011-11-29',)], ['Date']).select(F.col('Date').cast('date'))
df.show()
#  ---------- 
# |      Date|
#  ---------- 
# |2012-01-03|
# |2013-02-03|
# |2011-11-29|
#  ----------

This will create a new dataframe containing the difference in days:

df_diff = df.groupBy().agg(F.datediff(F.max('Date'), F.min('Date')).alias('diff'))

df_diff.show()
#  ---- 
# |diff|
#  ---- 
# | 432|
#  ---- 

# If you need the difference in a variable:
v = df_diff.head().diff
print(v)
# 432

And this will add a new column to your existing df:

df = df.withColumn('diff', F.expr('datediff(max(Date) over(), min(Date) over())'))

df.show()
#  ---------- ---- 
# |      Date|diff|
#  ---------- ---- 
# |2012-01-03| 432|
# |2013-02-03| 432|
# |2011-11-29| 432|
#  ---------- ----

CodePudding user response：

Supposing your dataframe df has only the column Date in date format, you can do the following:

from pyspark.sql import functions as F

(df.withColumn('Max_Date', F.max(F.col('Date')))
   .withColumn('Min_Date', F.min(F.col('Date')))
   .withColumn('Diff_days', F.datediff(F.col('Max_Date'), F.col('Min_Date')))
   .drop('Date').dropDuplicates())

In this link you can find more examples about sql functions for pyspark: https://sparkbyexamples.com/spark/spark-sql-functions/