The ingestion of a Hive database is very slow. To reduce the ingestion time, I'm ingesting it to a temp database, then I do hdfs dfs -rmdir /path/to/original_db
and a hdfs dfs -mv /path/of/temp_db /path/to/original_db
. Now I need to recreate the tables.
I plan to do a DROP DATABASE temp_db CASCADE
and recreate the new tables created by spark and change the command generated by SHOW CREATE TABLE temp_db.one_table
. I'll need to do it of each table and replace the temp_db location with the new original location.
Is there a better way to do it? Maybe I can just rename the DB and set a new location for the tables.
CodePudding user response:
No answer here, so here is my solution.
The solution I designed above worked fine. Using it, even my queries that started during the data update are working fine. Yarn takes care of any problem.
Here is the commands that I used when importing a new version of /path/mydb
. I'm using external Hive tables
- Import new data to
mydb_landing
. The tables are created using this pyspark command:df.write.option("path", table_data_path).format("orc").mode("overwrite").saveAsTable(".".join(['mydb_landing', table_name]))
- I already have all table names, so I get all create statments:
create_stmts = [
spark.sql(
'show create table {schema}.{table_name}'.format(schema=base_dest_landing, table_name=table_name)
).first().createtab_stmt
for table_name in tables
]
- Drop target database in Hive:
spark.sql('DROP DATABASE IF EXISTS {} CASCADE'.format(base_dest))
- Since it has just external tables, I now delete the data from hdfs:
hdfs dfs -rm -r -f -skipTrash "/path/mydb"
- Move from
mydb_landing
tomydb
. It is a very quick operation, basically a rename: `hdfs dfs -mv '/path/mydb_landing' '/path/mydb`` - Now I must recreate the database and tables pointing to the "new" location:
CREATE DATABASE mydb LOCATION '/path/mydb'
- Recreate tables. This guarantees that I get any schema update:
for stmt in create_stmts:
create_stmt = stmt.replace('_landing', '')
spark.sql(create_stmt)
- Delete landing database from Hive:
spark.sql('DROP DATABASE IF EXISTS mydb_landing CASCADE)
This Impala blog post has some nice ideas using Hive views.