Pyspark: Parquet tables visible in SQL?-CodePudding

I am fairly new to PySpark/Hive and I have a problem:

I have a dataframe and want to write it as a paritioned table to HDFS. So far, I've done that via:

df = spark.sql('''
                CREATE EXTERNAL TABLE database.df(
                ID                 STRING
               )
                PARTITIONED BY (
                  DATA_DATE_PART STRING
                )
                STORED AS PARQUET
                LOCATION 'hdfs://path/file'
                ''')  
df.createOrReplaceTempView("df")
df = spark.sql('''
               INSERT INTO database.df PARTITION(DATA_DATE_PART = '{}')
               SELECT ID    
               FROM df
               '''.format(date))

But as with growing dataframes, instead of having to define all columns, I thought there is a better solution to this:

df.write.mode('overwrite').partitionBy('DATA_DATE_PART').parquet("/path/file")

However, a table like this I cannot access via spark.sql nor see it in my HUE browser. I can see it though via PySpark shell: hdfs dfs -ls /path/ So my question, why is that? I've read that parquet files can be special when reading with SQL but my first script does well and the tables are visible everywhere.

CodePudding user response：

You just need to use saveAsTable function for that (doc). By default it stores data in the default location, but you can use the path option to redefine it & make a table "unmanaged" (see this doc for more details). Just use following code:

df.write.mode('overwrite').partitionBy('DATA_DATE_PART') \
  .format("parquet") \
  .option("path", "/path/file") \
  .saveAsTable("database.df")