I have column that is an array of ordered dates in a Pyspark dataframe:
dates:array
element:timestamp
[ "2012-02-13T00:00:00.000 0000", "2015-01-29T00:00:00.000 0000", "2018-07-29T00:00:00.000 0000", "2020-07-29T00:00:00.000 0000"]
I would like to create a new column that contains the date differences between each of these. So in the example, the desired output would be:
[1081, 1277, 731]
And the new column would be:
date_diffs:array
element: integer
Thanks in advance for helping me come up with the code to produce this.
CodePudding user response:
from datetime import datetime
tt = [ "2012-02-13T00:00:00.000 0000", "2015-01-29T00:00:00.000 0000", "2018-07-29T00:00:00.000 0000", "2020-07-29T00:00:00.000 0000"]
todate = lambda x: datetime.strptime(x[:10],'%Y-%m-%d')
dd = [todate(tt[ii])-todate(tt[ii-1]) for ii in range(1, len(tt))]
print(dd)
CodePudding user response:
from pyspark.sql import functions as F
from pyspark.sql.types import TimestampType, ArrayType, IntegerType
datalist = [
["2012-02-13T00:00:00.000 0000", "2015-01-29T00:00:00.000 0000", "2018-07-29T00:00:00.000 0000", "2020-07-29T00:00:00.000 0000"],
["2012-02-13T00:00:00.000 0000", "2015-01-30T00:00:00.000 0000", "2018-07-31T00:00:00.000 0000", "2020-08-01T00:00:00.000 0000"],
]
df = spark.createDataFrame(datalist, ['t1', 't2', 't3', 't4'])
for c in df.columns:
df = df.withColumn(c, df[c].cast(TimestampType()))
_udf = F.udf(lambda x1, x2, x3, x4: [(x2-x1).days, (x3-x2).days, (x4-x3).days], ArrayType(IntegerType()))
df = df.withColumn('date_diffs', _udf(df['t1'], df['t2'], df['t3'], df['t4']))
df.show()
result:
------------------- ------------------- ------------------- ------------------- -----------------
| t1| t2| t3| t4| date_diffs|
------------------- ------------------- ------------------- ------------------- -----------------
|2012-02-13 09:00:00|2015-01-29 09:00:00|2018-07-29 09:00:00|2020-07-29 09:00:00|[1081, 1277, 731]|
|2012-02-13 09:00:00|2015-01-30 09:00:00|2018-07-31 09:00:00|2020-08-01 09:00:00|[1082, 1278, 732]|
------------------- ------------------- ------------------- ------------------- -----------------