Populating an empty pyspark dataframe with auto-generated dates-CodePudding

I need help populating an empty dataframe in pyspark with auto-generated dates in a column in the format yyyy-mm-dd from 1900-01-01 to 2030-12-31.

CodePudding user response：

You can use the range() to generate some rows and then cast them to dates like in the examples below:

pyspark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, lit

spark = SparkSession.builder.getOrCreate()

(
    spark
    .range(0, 100000)
    .alias("id")
    .select(
        (to_date(lit("1900-01-01"))   col('id').cast("int")).alias("date")
    )
    .where(col("date") <= "2030-12-31")
).show()

The same in SQL

WITH all_dates AS ( 
    SELECT 
        TO_DATE('1900-01-01')   CAST(r.id as int) as date
    FROM RANGE(0, 100000) as r
 )
SELECT *
FROM all_dates
WHERE date <= "2030-12-31"

CodePudding user response：

Hi you can use the following sample code.

from pyspark.sql.functions import col, date_add, lit

spark.range(0, 14935).withColumn("my_date", date_add(lit("1990-01-01"), col("id").cast("int"))).show()