I have a directory structure looking like that on my HDFS system:
/some/path
├─ report01
│ ├─ file01.csv
│ ├─ file02.csv
│ ├─ file03.csv
│ └─ lot_of_other_non_csv_files
├─ report02
│ ├─ file01.csv
│ ├─ file02.csv
│ ├─ file03.csv
│ ├─ file04.csv
│ └─ lot_of_other_non_csv_files
└─ report03
├─ file01.csv
├─ file02.csv
└─ lot_of_other_non_csv_files
I would like to copy to my local system all CSV files while keeping the directory structure.
I tried hdfs dfs -copyToLocal /some/path/report*
but that method copies a lot of unnecessary (and quite large) files that I don't want to get.
I also tried hdfs dfs -copyToLocal /some/path/report*/file*.csv
but this does not preserve the directory structure and HDFS complains that the files already exist when it tries to copy the files from the folder report02
.
Is there a way to get only files matching a specific pattern while still keeping the original directory structure?
CodePudding user response:
As it seems that there isn't any solution directly implemented in Hadoop, I finally ended up creating my own bash script:
#!/bin/bash
# pattern of files to get
TO_GET=("*.csv$" "*.png$")
# pattern of files/directories to avoid
TO_AVOID=("*_temporary*")
# function to join an array by a specified separator:
# usage: join_arr ";" ${array[@]}
join_arr() {
local IFS="$1"
shift
echo "$*"
}
if (($# != 2))
then
echo "There should be two parameters (path of the directory to get and destination)."
else
# ensure that the provided path ends with a slash
[[ "$1" != */ ]] && path="$1/" || path="$1"
echo "Path to copy: $path"
# ensure that the provided destination ends with a slash and append result directory name
[[ "$2" != */ ]] && dest="$2/" || dest="$2"
dest="$dest$(basename $path)/"
echo "Destination: $dest"
# get name of all files matching the patterns
echo -n "Exploring path to find matching files... "
readarray -t files < <(hdfs dfs -ls -R "$path" | egrep -v "$(join_arr "|" "${TO_AVOID[@]}")" | egrep "$(join_arr "|" "${TO_GET[@]}")" | awk '{print $NF}' | cut -c $((${#path} 1))-)
echo "Done!"
# check if at least one file found
[ -z "$files" ] && echo "No files matching the patern."
# get files one by one
for file in ${files[@]}
do
path_and_file="$path$file"
dest_and_file="$dest$file"
# make sure the directory exist on the local file system
mkdir -p "$(dirname "$dest_and_file")"
# get file in a separate process to be able to execute the queries in parallel
(hdfs dfs -copyToLocal -f "$path_and_file" "$dest_and_file" && echo "$file") &
done
# wait for all queries to be finished
wait
fi
You can call the script like that:
$ script.sh "/some/hdfs/path/folder_to_get" "/some/local/path"
The script will create a directory folder_to_get
in /some/local/path
with all CSV and PNG files, respecting the directory structure.
Note: If you want to get other files than CSV and PNG, just modify the variable TO_GET
at the top of the script. You can also modify the TO_AVOID
variable to filter directories that you don't want to scan even if they contains CSV or PNG files.