I need help populating an empty dataframe in pyspark with auto-generated dates in a column in the format yyyy-mm-dd from 1900-01-01 to 2030-12-31.
CodePudding user response:
You can use the range()
to generate some rows and then cast them to dates like in the examples below:
pyspark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, lit
spark = SparkSession.builder.getOrCreate()
(
spark
.range(0, 100000)
.alias("id")
.select(
(to_date(lit("1900-01-01")) col('id').cast("int")).alias("date")
)
.where(col("date") <= "2030-12-31")
).show()
The same in SQL
WITH all_dates AS (
SELECT
TO_DATE('1900-01-01') CAST(r.id as int) as date
FROM RANGE(0, 100000) as r
)
SELECT *
FROM all_dates
WHERE date <= "2030-12-31"
CodePudding user response:
Hi you can use the following sample code.
from pyspark.sql.functions import col, date_add, lit
spark.range(0, 14935).withColumn("my_date", date_add(lit("1990-01-01"), col("id").cast("int"))).show()