Here is my input table:
Name Date
Nancy 2021-08-14
Rictk 2021-08-15
Francky 2021-08-16
Nancy 2021-08-18
Nancy 2022-02-07
Francky 2021-12-06
So here I want to calculate day difference between two unique name records, for Example- Nancy has Three records but we only needs to take top two records and calculate day difference between them (take only first two record for every person). if there is only one record for a person then return as 0.
I want output dataframe like
Name Date Day
Nancy 2021-08-14 4
Rictk 2021-08-15 0
Francky 2021-08-16 6
Nancy 2021-08-18 4
Nancy 2022-02-07 4
Francky 2021-08-22 6
Explanation- Nancy has 4 days because here we consider top two records of Nancy and the date was 2021-08-14 and 2021-08-18 so there is 4 days gap. likewise same for remaining all.
So how I can do this in pyspark or python.
CodePudding user response:
Firstly your output and input dataset values do not match , hence you would see a difference in the results
You can achieve this with a combination of LEAD & DATEDIFF Partitioned by - Name
& Ordered By - Date
and only take the MIN of the datediff and propagate it to your input set
Data Preparation
s = StringIO("""
Name,Date
Nancy,2021-08-14
Rictk,2021-08-15
Francky,2021-08-16
Nancy,2021-08-18
Nancy,2022-02-07
Francky,2021-12-06
""")
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('Date',F.to_date(F.col('date'), 'yyyy-MM-dd'))\
.orderBy(*[F.col('Name'),F.col('Date')])
sparkDF.show()
------- ----------
| Name| Date|
------- ----------
|Francky|2021-08-16|
|Francky|2021-12-06|
| Nancy|2021-08-14|
| Nancy|2021-08-18|
| Nancy|2022-02-07|
| Rictk|2021-08-15|
------- ----------
Lead & DateDiff - Date
Generate Lead values for the combination of Name & Date
window = Window.partitionBy('Name').orderBy(F.col('Date'))
sparkDF = sparkDF.withColumn('lead_date',F.lead(F.col('Date'),-1).over(window))
sparkDF = sparkDF.withColumn('lead_date_diff', F.datediff(
F.lead(F.col('Date'),1).over(window)
,F.col('Date')
)
)
sparkDF.show()
------- ---------- ---------- --------------
| Name| Date| lead_date|lead_date_diff|
------- ---------- ---------- --------------
| Rictk|2021-08-15| null| null|
| Nancy|2021-08-14| null| 4|
| Nancy|2021-08-18|2021-08-14| 173|
| Nancy|2022-02-07|2021-08-18| null|
|Francky|2021-08-16| null| 112|
|Francky|2021-12-06|2021-08-16| null|
------- ---------- ---------- --------------
Join
Finally aggregate the above resultset to only take the MIN of datediff & merge it with your input set
sparkDFAgg = sparkDF.groupby('Name').agg(F.min(F.col('lead_date_diff')).alias('Day'))\
.select(*[F.col('Name').alias('Name_Key'),F.col('Day')])
sparkDF = sparkDF.join(sparkDFAgg
,sparkDF['Name'] == sparkDFAgg['Name_Key']
,'left'
).select(sparkDF['Name']
,sparkDF['Date']
,sparkDFAgg['Day']
).fillna(0)
sparkDF.show()
------- ---------- ---
| Name| Date|Day|
------- ---------- ---
| Rictk|2021-08-15| 0|
| Nancy|2021-08-14| 4|
| Nancy|2021-08-18| 4|
| Nancy|2022-02-07| 4|
|Francky|2021-08-16|112|
|Francky|2021-12-06|112|
------- ---------- ---
Note - You can further reduce your computation by only taking the first 2 records for generating the lead_date_diff
column pre-filtering
CodePudding user response:
in order to calculate the gap between 2 dates you need to first convert the time string into a datetime
object like so:
from datetime import datetime
date1 = datetime.datetime.strptime('2021-08-18', '%Y-%m-%d')
date2 = datetime.datetime.strptime('2021-08-14', '%Y-%m-%d')
- note that you must provide the correct datetime format so that your input string would be parsed correctly.
after you created above objects you can use the -
operator to calculated the difference between the two dates and extract the days parameter:
diff = date1 - date2
diff_days = diff.days