Home > database >  Unknown date data type (spark,parquet) [13 character long]
Unknown date data type (spark,parquet) [13 character long]

Time:12-16

I have a parquet file with the date column filled with a data type I am having trouble with

enter image description here

I understand that Hive and Impala tend to rebase their time stamp...However, I cannot seem to convert or find any pointers on how to solve this.

I have tried setting int96RebaseModeInRead and datetimeRebaseModeInRead mode to legacy

I also tried to apply a date schema onto the read operation but to no avail.

enter image description here

This is with schema applied

enter image description here

enter image description here

These are the documentations I've reviewed so far. Maybe there's a simple solution I am not seeing. Let's also assume that there's no way for me to ask the person who created the source file what the heck they did.

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option

https://kontext.tech/article/1062/spark-2x-to-3x-date-timestamp-and-int96-rebase-modes

https://docs.cloudera.com/runtime/7.2.1/developing-spark-applications/topics/spark-timestamp-compatibility-parquet.html

Also, this thread is the only one I was able to find that shows how the timestamp is created but not how to reverse it. Please give me some pointers. parquet int96 timestamp conversion to datetime/date via python

CodePudding user response:

As I understand you try to cast order_date column to dateType. If thats the case following code could help. You can read order_date column as stringType from source file and you should use your own timezone for from_utc_timestamp method.

from pyspark.sql.functions import from_utc_timestamp
from pyspark.sql.types import StringType

d = ['1374710400000']
df = spark.createDataFrame(d, StringType())
df.show()


df = df.withColumn('new_date',from_utc_timestamp(from_unixtime(df.value/1000,"yyyy-MM-dd hh:mm:ss"),'GMT 1')).show()

Output:

 ------------- 
|        value|
 ------------- 
|1374710400000|
 ------------- 

 ------------- ------------------- 
|        value|         new_date|
 ------------- ------------------- 
|1374710400000|2013-07-25 13:00:00|
 ------------- ------------------- 
  • Related