Pandas API on Spark - Difference between two date columns-CodePudding

I want the difference between two date columns in the number of days.
In pandas dataframe difference in two "datetime64" type columns returns number of days
but in pyspark.pandas dataframe the difference is returned in the "int" type.

import pandas as pd
import pyspark.pandas as ps

data = {
    "d1": [
        "2019-05-18",
        "2019-06-21",
        "2019-05-08",
        "2019-05-22",
        "2019-11-20",
        "2019-05-29",
    ],
    "d2": [
        "2019-05-21",
        "2019-06-21",
        "2019-05-09",
        "2019-05-23",
        "2019-11-21",
        "2019-05-30",
    ],
}
df = pd.DataFrame(data)
df[["d1", "d2"]] = df[["d1", "d2"]].astype("datetime64")
pdf = ps.from_pandas(df)

df["diff"] = (df["d1"] - df["d2"]).dt.days
pdf["diff"] = pdf["d1"] - pdf["d2"]

Pandas difference Output: [-3, 0, -1, -1, -1, -1]
PySpark difference Output: [-259200, 0, -86400, -86400, -86400, -86400]

Expected Op: PySpark difference should return in days format and be accessible using dt.days

Tried:

Converting int column to DateTime

pdf['diff'].astype('datetime64')

>>>['1969-12-29 00:00:00', '1970-01-01 00:00:00',
    '1969-12-31 00:00:00', '1969-12-31 00:00:00',
    '1969-12-31 00:00:00', '1969-12-31 00:00:00']

The current sol that I am working with is

temp = pdf[["d1", "d2"]].to_pandas()
pdf["diff2"] = ps.Series((temp["d1"] - temp["d2"]).dt.days)

>>> [-3,  0, -1, -1, -1, -1]

This sol works, but for large datasets converting to pandas and doing operation on the two columns adds overhead and increases the runtime. So any concise way to get difference in two datetime columns in pyspark.pandas dataframe??

CodePudding user response：

the diff is a timestamp diff, therefore, the result is in second : 259200 / 3600 / 24 = 3. Just add some math and you'll get your expected result.

You can also use datediff :

from pyspark.sql import functions as F 


df.select(F.datediff(df.d2, df.d1).alias('diff')).collect()
# [Row(diff=32)]

CodePudding user response：

Apply to_timedelta() and use unit as "seconds":

pdf["diff"] = (pdf['d1'] - pdf['d2']).apply(lambda x: ps.to_timedelta(x, unit="seconds").days)

Full example:

import pandas as pd
import pyspark.pandas as ps

data = {
    'd1':['2019-05-18','2019-06-21','2019-05-08','2019-05-22','2019-11-20','2019-05-29'],
    'd2':['2019-05-21','2019-06-21','2019-05-09','2019-05-23','2019-11-21','2019-05-30']
       }
df = pd.DataFrame(data)
df[['d1','d2']] = df[['d1','d2']].astype('datetime64')
pdf = ps.from_pandas(df)

df["diff"] = (df['d1'] - df['d2']).dt.days

pdf["diff"] = (pdf['d1'] - pdf['d2']).apply(lambda x: ps.to_timedelta(x, unit="seconds").days)

print(pdf)

[Out]:
          d1         d2  diff
0 2019-05-18 2019-05-21    -3
1 2019-06-21 2019-06-21     0
2 2019-05-08 2019-05-09    -1
3 2019-05-22 2019-05-23    -1
4 2019-11-20 2019-11-21    -1
5 2019-05-29 2019-05-30    -1