My table loaded in PySpark has a column "Date" with the following type of data:
Date | Open | High | Low | Close | Volume | Adj Close |
---|---|---|---|---|---|---|
1/3/2012 | 59.97 | 61.06 | 59.87 | 60.33 | 12668800 | 52.61923 |
1/4/2012 | 60.21 | 60.35 | 59.47 | 59.71 | 9593300 | 52.07848 |
1/5/2012 | 59.35 | 59.62 | 58.37 | 59.42 | 12768200 | 51.82554 |
1/6/2012 | 59.42 | 59.45 | 58.87 | 59 | 8069400 | 51.45922 |
How do I calculate the difference, in days, between the max and the min of the column (so in the example above, I need the difference in day between 1/6/2012 and 1/3/2012
CodePudding user response:
Test data:
from pyspark.sql import functions as F
df = spark.createDataFrame([('2012-01-03',),('2013-02-03',),('2011-11-29',)], ['Date']).select(F.col('Date').cast('date'))
df.show()
# ----------
# | Date|
# ----------
# |2012-01-03|
# |2013-02-03|
# |2011-11-29|
# ----------
This will create a new dataframe containing the difference in days:
df_diff = df.groupBy().agg(F.datediff(F.max('Date'), F.min('Date')).alias('diff'))
df_diff.show()
# ----
# |diff|
# ----
# | 432|
# ----
# If you need the difference in a variable:
v = df_diff.head().diff
print(v)
# 432
And this will add a new column to your existing df:
df = df.withColumn('diff', F.expr('datediff(max(Date) over(), min(Date) over())'))
df.show()
# ---------- ----
# | Date|diff|
# ---------- ----
# |2012-01-03| 432|
# |2013-02-03| 432|
# |2011-11-29| 432|
# ---------- ----
CodePudding user response:
Supposing your dataframe df
has only the column Date
in date format, you can do the following:
from pyspark.sql import functions as F
(df.withColumn('Max_Date', F.max(F.col('Date')))
.withColumn('Min_Date', F.min(F.col('Date')))
.withColumn('Diff_days', F.datediff(F.col('Max_Date'), F.col('Min_Date')))
.drop('Date').dropDuplicates())
In this link you can find more examples about sql functions for pyspark: https://sparkbyexamples.com/spark/spark-sql-functions/