I am working on an audit process to delete empty Hive databases. I have a large number of databases that I need to go through and would like to use a shell script (.sh) in Linux that can run hive -e
queries to identify empty databases and list them in some output file or log (wondering if a .txt file is an option?). I would then send this list to our admins to "drop" those empty databases. All of our databases follow the exact same naming convention: "environment" and "area" are always used... only "state" varies. >>> environment_area_<state>
Right now I am using the following queries to get the job done but it's very manual and very slow... I would end up spending a lot of time on the Linux command line.
I connect to Hive first in PuTTY, once connected, I run:
show databases;
use environment_area_<state>;
show tables;
If no tables show in the database, I add it to my list of databases that need to be deleted. I run the "use" and "show tables" queries over and over for every database.
As you can tell, this is a very time consuming approach and creating a shell script would really help.
I have searched online and watched a number of YouTube tutorials, but have not come across a use case that would help me out. Hoping someone much more experienced with shell scripts could help me get beyond #!/bin/bash
followed by the queries I listed above.
CodePudding user response:
Keeping hive answering commands in the background may improve performance significantly:
#!/usr/bin/env bash
tempdir=$(mktemp -d)
# Cleanup at end of execution
trap 'rm -fr -- "$tempdir";exit' EXIT INT
hivein="$tempdir/hivein"
hiveout="$tempdir/hiveout"
mkfifo "$hivein" "$hiveout"
# Prepare file descriptors IO to talk to hive
exec 3<>"$hivein"
exec 4<>"$hiveout"
# Launch hive in the background
hive -S <&3 >&4 &
# Initialise hive
printf '%s\n' 'set hive.cli.print.header=false;' >&3
# Wait for hive response and get databases list
printf '%s\n' "SHOW DATABASES LIKE 'environment_area_*';" >&3
mapfile -u 4 -t databases
empty_databases=()
for db in "${databases[@]}"; do
printf 'USE %s; SHOW TABLES;\n' "$db" >&3
mapfile -u 4 -t tables
tbl_count="${#tables[@]}"
printf 'Database %s contains %d tables.\n' "$db" "$tbl_count"
if [ "$tbl_count" -eq 0 ]; then
# record empty db
empty_databases =("$db")
fi
done
# Close the hive-cli in case closing the file descriptors is not enough
printf '%s\n' '!exit' >&3
printf '%s\n' "${empty_databases[@]}" >empty_databases_list.txt
CodePudding user response:
To start with something you can modify this script. I did not check it. maybe show tables
returns some header or extra newline, then modify script accordingly (wc -l
counts newlines in output).
Script:
#!/bin/bash
for db in $(hive -S -e "show databases;")
do
tbl_count=$(hive -S -e "use $db; show tables;" | wc -l)
echo "Database $db contains $tbl_count tables."
if [ ${tbl_count} -eq 0 ]; then
# Add db name to the file
echo "$db" >> empty_databases_list.txt
# Do something else, for example drop db, etc
fi
done