I am fairly new to PySpark/Hive and I have a problem:
I have a dataframe and want to write it as a paritioned table to HDFS. So far, I've done that via:
df = spark.sql('''
CREATE EXTERNAL TABLE database.df(
ID STRING
)
PARTITIONED BY (
DATA_DATE_PART STRING
)
STORED AS PARQUET
LOCATION 'hdfs://path/file'
''')
df.createOrReplaceTempView("df")
df = spark.sql('''
INSERT INTO database.df PARTITION(DATA_DATE_PART = '{}')
SELECT ID
FROM df
'''.format(date))
But as with growing dataframes, instead of having to define all columns, I thought there is a better solution to this:
df.write.mode('overwrite').partitionBy('DATA_DATE_PART').parquet("/path/file")
However, a table like this I cannot access via spark.sql nor see it in my HUE browser. I can see it though via PySpark shell: hdfs dfs -ls /path/
So my question, why is that? I've read that parquet files can be special when reading with SQL but my first script does well and the tables are visible everywhere.
CodePudding user response:
You just need to use saveAsTable
function for that (doc). By default it stores data in the default location, but you can use the path
option to redefine it & make a table "unmanaged" (see this doc for more details). Just use following code:
df.write.mode('overwrite').partitionBy('DATA_DATE_PART') \
.format("parquet") \
.option("path", "/path/file") \
.saveAsTable("database.df")