Home > Blockchain >  Shell Script to find empty Hive databases
Shell Script to find empty Hive databases

Time:09-21

I am working on an audit process to delete empty Hive databases. I have a large number of databases that I need to go through and would like to use a shell script (.sh) in Linux that can run hive -e queries to identify empty databases and list them in some output file or log (wondering if a .txt file is an option?). I would then send this list to our admins to "drop" those empty databases. All of our databases follow the exact same naming convention: "environment" and "area" are always used... only "state" varies. >>> environment_area_<state>

Right now I am using the following queries to get the job done but it's very manual and very slow... I would end up spending a lot of time on the Linux command line.

I connect to Hive first in PuTTY, once connected, I run:

show databases;
use environment_area_<state>;
show tables;

If no tables show in the database, I add it to my list of databases that need to be deleted. I run the "use" and "show tables" queries over and over for every database.

As you can tell, this is a very time consuming approach and creating a shell script would really help.

I have searched online and watched a number of YouTube tutorials, but have not come across a use case that would help me out. Hoping someone much more experienced with shell scripts could help me get beyond #!/bin/bash followed by the queries I listed above.

CodePudding user response:

Keeping hive answering commands in the background may improve performance significantly:

#!/usr/bin/env bash

tempdir=$(mktemp -d)
# Cleanup at end of execution
trap 'rm -fr -- "$tempdir";exit' EXIT INT

hivein="$tempdir/hivein"
hiveout="$tempdir/hiveout"
mkfifo "$hivein" "$hiveout"

# Prepare file descriptors IO to talk to hive
exec 3<>"$hivein"
exec 4<>"$hiveout"

# Launch hive in the background
hive -S <&3 >&4 &

# Initialise hive
printf '%s\n' 'set hive.cli.print.header=false;' >&3

# Wait for hive response and get databases list
printf '%s\n' "SHOW DATABASES LIKE 'environment_area_*';" >&3
mapfile -u 4 -t databases

empty_databases=()

for db in "${databases[@]}"; do
  printf 'USE %s; SHOW TABLES;\n' "$db" >&3
  mapfile -u 4 -t tables
  tbl_count="${#tables[@]}"
  printf 'Database %s contains %d tables.\n' "$db" "$tbl_count"

  if [ "$tbl_count" -eq 0 ]; then
    # record empty db
    empty_databases =("$db")
  fi
done

# Close the hive-cli in case closing the file descriptors is not enough
printf '%s\n' '!exit' >&3

printf '%s\n' "${empty_databases[@]}" >empty_databases_list.txt

CodePudding user response:

To start with something you can modify this script. I did not check it. maybe show tables returns some header or extra newline, then modify script accordingly (wc -l counts newlines in output).

Script:

#!/bin/bash

for db in $(hive -S -e "show databases;") 
do
   tbl_count=$(hive -S -e "use $db; show tables;" | wc -l)
   echo "Database $db contains $tbl_count tables."
   
   if [ ${tbl_count} -eq 0 ]; then
     # Add db name to the file
     echo "$db" >> empty_databases_list.txt 
     # Do something else, for example drop db, etc
   fi
done
  • Related