Home > Net >  Pyspark: Parquet tables visible in SQL?
Pyspark: Parquet tables visible in SQL?

Time:03-06

I am fairly new to PySpark/Hive and I have a problem:

I have a dataframe and want to write it as a paritioned table to HDFS. So far, I've done that via:

df = spark.sql('''
                CREATE EXTERNAL TABLE database.df(
                ID                 STRING
               )
                PARTITIONED BY (
                  DATA_DATE_PART STRING
                )
                STORED AS PARQUET
                LOCATION 'hdfs://path/file'
                ''')  
df.createOrReplaceTempView("df")
df = spark.sql('''
               INSERT INTO database.df PARTITION(DATA_DATE_PART = '{}')
               SELECT ID    
               FROM df
               '''.format(date))

But as with growing dataframes, instead of having to define all columns, I thought there is a better solution to this:

df.write.mode('overwrite').partitionBy('DATA_DATE_PART').parquet("/path/file")

However, a table like this I cannot access via spark.sql nor see it in my HUE browser. I can see it though via PySpark shell: hdfs dfs -ls /path/ So my question, why is that? I've read that parquet files can be special when reading with SQL but my first script does well and the tables are visible everywhere.

CodePudding user response:

You just need to use saveAsTable function for that (doc). By default it stores data in the default location, but you can use the path option to redefine it & make a table "unmanaged" (see this doc for more details). Just use following code:

df.write.mode('overwrite').partitionBy('DATA_DATE_PART') \
  .format("parquet") \
  .option("path", "/path/file") \
  .saveAsTable("database.df")
  • Related