Home > database >  Getting day differences between array of dates in pyspark
Getting day differences between array of dates in pyspark

Time:09-01

I have column that is an array of ordered dates in a Pyspark dataframe:

dates:array
     element:timestamp

[ "2012-02-13T00:00:00.000 0000", "2015-01-29T00:00:00.000 0000", "2018-07-29T00:00:00.000 0000", "2020-07-29T00:00:00.000 0000"]

I would like to create a new column that contains the date differences between each of these. So in the example, the desired output would be:

[1081, 1277, 731]

And the new column would be:

date_diffs:array
  element: integer

Thanks in advance for helping me come up with the code to produce this.

CodePudding user response:

from datetime import datetime

tt = [ "2012-02-13T00:00:00.000 0000", "2015-01-29T00:00:00.000 0000", "2018-07-29T00:00:00.000 0000", "2020-07-29T00:00:00.000 0000"]
todate = lambda x: datetime.strptime(x[:10],'%Y-%m-%d')
dd = [todate(tt[ii])-todate(tt[ii-1]) for ii in range(1, len(tt))]
print(dd)

CodePudding user response:

from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType, ArrayType, IntegerType

datalist = [
    ["2012-02-13T00:00:00.000 0000", "2015-01-29T00:00:00.000 0000", "2018-07-29T00:00:00.000 0000", "2020-07-29T00:00:00.000 0000"],
    ["2012-02-13T00:00:00.000 0000", "2015-01-30T00:00:00.000 0000", "2018-07-31T00:00:00.000 0000", "2020-08-01T00:00:00.000 0000"],
]

df = spark.createDataFrame(datalist, ['t1', 't2', 't3', 't4'])
for c in df.columns:
    df = df.withColumn(c, df[c].cast(TimestampType()))

_udf = F.udf(lambda x1, x2, x3, x4: [(x2-x1).days, (x3-x2).days, (x4-x3).days], ArrayType(IntegerType()))
df = df.withColumn('date_diffs', _udf(df['t1'], df['t2'], df['t3'], df['t4']))

df.show()

result:

 ------------------- ------------------- ------------------- ------------------- ----------------- 
|                 t1|                 t2|                 t3|                 t4|       date_diffs|
 ------------------- ------------------- ------------------- ------------------- ----------------- 
|2012-02-13 09:00:00|2015-01-29 09:00:00|2018-07-29 09:00:00|2020-07-29 09:00:00|[1081, 1277, 731]|
|2012-02-13 09:00:00|2015-01-30 09:00:00|2018-07-31 09:00:00|2020-08-01 09:00:00|[1082, 1278, 732]|
 ------------------- ------------------- ------------------- ------------------- ----------------- 
  • Related