I am reading a table from PostgreSQL. The table contains a column elap_time
whose datatype
is time
. But when reading into spark DataFrame
it automatically converts it to timestamp
. How to avoid this.
config = {"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
"fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
}
conf = spark.sparkContext._jsc.hadoopConfiguration()
for (name, value) in config.items():
conf.set(name, value)
source = {
"url": url,
'user': user,
'database': db,
'password': pass,
'port': '5432',
'driver': 'org.postgresql.Driver',
'dbtable': 'newtable'
}
spark.read.format("JDBC").options(**source).load()
df.printSchema()
this is the output:
#|-- elap_time: timestamp (nullable = true)
This is the actual data in the Database
-------------------
| elap_time |
-------------------
|08:02:19 |
|08:02:19 |
-------------------
This is the output of spark dataframe
-------------------
| elap_time|
-------------------
|1970-01-01 08:02:19|
|1970-01-01 08:02:19|
-------------------
CodePudding user response:
As suggested by user DKNY in comment section. "your dataframe is created, add a new column using withColumn and then parse the original time as date_format(to_timestamp(), "HH:mm:ss"). You can replace the original column by the newly created column. I know this is kinda workaround, but I can't think of any alternative solution as of now."
from pyspark.sql.functions import date_format
df = df.withColumn('date', date_format('date', 'HH:mm:ss'))
df.show()