Home > Blockchain >  Column with array with all months in x amount of years from starting date - Pyspark
Column with array with all months in x amount of years from starting date - Pyspark

Time:02-24

Imagine you have a dataframe df as follows:

ID  Years     Date
A    5    2021-02-01
B    3    2021-02-01
C    6    2021-02-01

I want to be able to create an additional date array column with all the dates starting from the initial date 1 month all the way to the x amount of years in the years column. It would look like the following:

ID  Years     Date        Dates
A    5    2021-02-01     [2021-03-01,2021-04-01,...,2026-02-01]
B    3    2021-03-01     [2021-04-01,2021-04-01,...,2024-03-01]
C    6    2021-02-01     [2021-03-01,2021-04-01,...,2027-02-01]

CodePudding user response:

For spark >= 2.4, you can use the sequence and add_months functions to generate the desired sequence of dates.

df = df.withColumn('Dates',
                   F.expr('sequence(add_months(to_date(Date), 1), add_months(to_date(Date), int(Years) * 12), interval 1 month)')
                   )
df.show(truncate=False)

CodePudding user response:

I am no expert in PySpark, and I heard that changing PySpark to a pandas dataframe could not be interesting in some cases. But if it's ok, you can change the format to pandas and try using apply function:

df = df_spark.toPandas()
def getRangeDate(row):
  return list(map(lambda x: x.strftime("%Y-%m-%d"), list(pd.date_range(start = row["Date"], periods = 12*row["Years"] 1, freq = 'MS'))[1:]))
df['Dates'] = df.apply(getRangeDate, axis=1)
df

which, regarding your example input, has the following output:

    ID  Years   Date    Dates
0   A   5   2021-02-01  [2021-03-01, 2021-04-01, 2021-05-01, ..., 2026-02-01]
1   B   3   2021-03-01  [2021-04-01, 2021-05-01, 2021-06-01, ..., 2024-03-01]
2   C   6   2021-02-01  [2021-03-01, 2021-04-01, 2021-05-01, ..., 2027-02-01]
  • Related