I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:
spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")
The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location
s3://bucket_name/databases/my_db/my_perfectly_created_table
there is exactly one table that creates itself with this location:
s3://bucket_name/databases/my_db/my_problematic_table-__PLACEHOLDER__
I don't know where that -__PLACEHOLDER__
is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:
s3://bucket_name/databases/my_db/my_problematic_table
so I know the problem is just creating the table correctly because all I get is a col (array<string>)
when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__
).
Have any of you guys dealt with this before?
CodePudding user response:
Upon closer inspection in AWS glue, this specific problematic_table
had the following config, specific for CSV files and custom-delimiters:
Input Format org.apache.hadoop.mapred.SequenceFileInputFormat
Output Format org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
while my other tables had the config specific for parquet:
Input Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
I tried to create the table forcing the config for parquet with the following command:
val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
val my_table = spark.read.format("parquet").load(path)
val ddlSchema = my_table.toDF.schema.toDDL
spark.sql(s"""
|CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
|ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
|OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
|LOCATION '$path'
|""".stripMargin
)
but it threw the following error:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column
so the problem was the naming of those columns, "1", "2" & "3" within that struct.
Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.
Hope this helps anyone