Home > Back-end >  PySpark DataFrame keeps converting time column to timestamp
PySpark DataFrame keeps converting time column to timestamp

Time:04-12

I am reading a table from PostgreSQL. The table contains a column elap_time whose datatype is time. But when reading into spark DataFrame it automatically converts it to timestamp. How to avoid this.

config = {"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
           "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
           }

conf = spark.sparkContext._jsc.hadoopConfiguration()
for (name, value) in config.items():
      conf.set(name, value)

source = {
        "url": url,
        'user': user,
        'database': db,
        'password': pass,
        'port': '5432',
        'driver': 'org.postgresql.Driver',
        'dbtable': 'newtable'
    }
spark.read.format("JDBC").options(**source).load()
df.printSchema()

this is the output:

#|-- elap_time: timestamp (nullable = true)

This is the actual data in the Database

 -------------------                                                            
|         elap_time |
 ------------------- 
|08:02:19           |
|08:02:19           |
 ------------------- 

This is the output of spark dataframe

 -------------------                                                            
|          elap_time|
 ------------------- 
|1970-01-01 08:02:19|
|1970-01-01 08:02:19|
 ------------------- 

CodePudding user response:

As suggested by user DKNY in comment section. "your dataframe is created, add a new column using withColumn and then parse the original time as date_format(to_timestamp(), "HH:mm:ss"). You can replace the original column by the newly created column. I know this is kinda workaround, but I can't think of any alternative solution as of now."

from pyspark.sql.functions import date_format

df = df.withColumn('date', date_format('date', 'HH:mm:ss'))
df.show()
  • Related