How to calculate Date Difference in pyspark or python?-CodePudding

Here is my input table:

Name            Date
Nancy           2021-08-14
Rictk           2021-08-15
Francky         2021-08-16
Nancy           2021-08-18
Nancy           2022-02-07
Francky         2021-12-06

So here I want to calculate day difference between two unique name records, for Example- Nancy has Three records but we only needs to take top two records and calculate day difference between them (take only first two record for every person). if there is only one record for a person then return as 0.

I want output dataframe like

Name            Date            Day
Nancy           2021-08-14      4
Rictk           2021-08-15      0
Francky         2021-08-16      6
Nancy           2021-08-18      4
Nancy           2022-02-07      4
Francky         2021-08-22      6

Explanation- Nancy has 4 days because here we consider top two records of Nancy and the date was 2021-08-14 and 2021-08-18 so there is 4 days gap. likewise same for remaining all.

So how I can do this in pyspark or python.

CodePudding user response：

Firstly your output and input dataset values do not match , hence you would see a difference in the results

You can achieve this with a combination of LEAD & DATEDIFF Partitioned by - Name & Ordered By - Date and only take the MIN of the datediff and propagate it to your input set

Data Preparation

s = StringIO("""
Name,Date
Nancy,2021-08-14
Rictk,2021-08-15
Francky,2021-08-16
Nancy,2021-08-18
Nancy,2022-02-07
Francky,2021-12-06
""")

df = pd.read_csv(s,delimiter=',')

sparkDF = sql.createDataFrame(df)\
             .withColumn('Date',F.to_date(F.col('date'), 'yyyy-MM-dd'))\
             .orderBy(*[F.col('Name'),F.col('Date')])

sparkDF.show()

 ------- ---------- 
|   Name|      Date|
 ------- ---------- 
|Francky|2021-08-16|
|Francky|2021-12-06|
|  Nancy|2021-08-14|
|  Nancy|2021-08-18|
|  Nancy|2022-02-07|
|  Rictk|2021-08-15|
 ------- ----------

Lead & DateDiff - Date

Generate Lead values for the combination of Name & Date

window = Window.partitionBy('Name').orderBy(F.col('Date'))

sparkDF = sparkDF.withColumn('lead_date',F.lead(F.col('Date'),-1).over(window))

sparkDF = sparkDF.withColumn('lead_date_diff', F.datediff(
                                                           F.lead(F.col('Date'),1).over(window)
                                                          ,F.col('Date')
                                                        )
                                            )
sparkDF.show()

 ------- ---------- ---------- -------------- 
|   Name|      Date| lead_date|lead_date_diff|
 ------- ---------- ---------- -------------- 
|  Rictk|2021-08-15|      null|          null|
|  Nancy|2021-08-14|      null|             4|
|  Nancy|2021-08-18|2021-08-14|           173|
|  Nancy|2022-02-07|2021-08-18|          null|
|Francky|2021-08-16|      null|           112|
|Francky|2021-12-06|2021-08-16|          null|
 ------- ---------- ---------- --------------

Join

Finally aggregate the above resultset to only take the MIN of datediff & merge it with your input set

sparkDFAgg = sparkDF.groupby('Name').agg(F.min(F.col('lead_date_diff')).alias('Day'))\
                    .select(*[F.col('Name').alias('Name_Key'),F.col('Day')])

sparkDF = sparkDF.join(sparkDFAgg
                    ,sparkDF['Name'] == sparkDFAgg['Name_Key']
                    ,'left'
          ).select(sparkDF['Name']
                   ,sparkDF['Date']
                   ,sparkDFAgg['Day']
          ).fillna(0)


sparkDF.show()

 ------- ---------- --- 
|   Name|      Date|Day|
 ------- ---------- --- 
|  Rictk|2021-08-15|  0|
|  Nancy|2021-08-14|  4|
|  Nancy|2021-08-18|  4|
|  Nancy|2022-02-07|  4|
|Francky|2021-08-16|112|
|Francky|2021-12-06|112|
 ------- ---------- ---

Note - You can further reduce your computation by only taking the first 2 records for generating the lead_date_diff column pre-filtering

CodePudding user response：

in order to calculate the gap between 2 dates you need to first convert the time string into a datetime object like so:

from datetime import datetime
date1 = datetime.datetime.strptime('2021-08-18', '%Y-%m-%d')
date2 = datetime.datetime.strptime('2021-08-14', '%Y-%m-%d')

note that you must provide the correct datetime format so that your input string would be parsed correctly.

after you created above objects you can use the - operator to calculated the difference between the two dates and extract the days parameter:

diff = date1 - date2
diff_days = diff.days