Hive: how to move a database and recreate tables to minimize downtime?-CodePudding

The ingestion of a Hive database is very slow. To reduce the ingestion time, I'm ingesting it to a temp database, then I do hdfs dfs -rmdir /path/to/original_db and a hdfs dfs -mv /path/of/temp_db /path/to/original_db. Now I need to recreate the tables.

I plan to do a DROP DATABASE temp_db CASCADE and recreate the new tables created by spark and change the command generated by SHOW CREATE TABLE temp_db.one_table. I'll need to do it of each table and replace the temp_db location with the new original location.

Is there a better way to do it? Maybe I can just rename the DB and set a new location for the tables.

CodePudding user response：

No answer here, so here is my solution.

The solution I designed above worked fine. Using it, even my queries that started during the data update are working fine. Yarn takes care of any problem.

Here is the commands that I used when importing a new version of /path/mydb. I'm using external Hive tables

Import new data to mydb_landing. The tables are created using this pyspark command: df.write.option("path", table_data_path).format("orc").mode("overwrite").saveAsTable(".".join(['mydb_landing', table_name]))
I already have all table names, so I get all create statments:

        create_stmts = [
            spark.sql(
                'show create table {schema}.{table_name}'.format(schema=base_dest_landing, table_name=table_name)
                ).first().createtab_stmt 
            for table_name in tables
            ]

Drop target database in Hive: spark.sql('DROP DATABASE IF EXISTS {} CASCADE'.format(base_dest))
Since it has just external tables, I now delete the data from hdfs: hdfs dfs -rm -r -f -skipTrash "/path/mydb"
Move from mydb_landing to mydb. It is a very quick operation, basically a rename: `hdfs dfs -mv '/path/mydb_landing' '/path/mydb``
Now I must recreate the database and tables pointing to the "new" location: CREATE DATABASE mydb LOCATION '/path/mydb'
Recreate tables. This guarantees that I get any schema update:

   for stmt in create_stmts:
           create_stmt = stmt.replace('_landing', '')
           spark.sql(create_stmt)

Delete landing database from Hive: spark.sql('DROP DATABASE IF EXISTS mydb_landing CASCADE)

This Impala blog post has some nice ideas using Hive views.