Home > Mobile >  Get specific files while keeping directory structure from HDFS
Get specific files while keeping directory structure from HDFS

Time:09-03

I have a directory structure looking like that on my HDFS system:

/some/path
  ├─ report01
  │    ├─ file01.csv
  │    ├─ file02.csv
  │    ├─ file03.csv
  │    └─ lot_of_other_non_csv_files
  ├─ report02
  │    ├─ file01.csv
  │    ├─ file02.csv
  │    ├─ file03.csv
  │    ├─ file04.csv
  │    └─ lot_of_other_non_csv_files
  └─ report03
       ├─ file01.csv
       ├─ file02.csv
       └─ lot_of_other_non_csv_files

I would like to copy to my local system all CSV files while keeping the directory structure.

I tried hdfs dfs -copyToLocal /some/path/report* but that method copies a lot of unnecessary (and quite large) files that I don't want to get.

I also tried hdfs dfs -copyToLocal /some/path/report*/file*.csv but this does not preserve the directory structure and HDFS complains that the files already exist when it tries to copy the files from the folder report02.

Is there a way to get only files matching a specific pattern while still keeping the original directory structure?

CodePudding user response:

As it seems that there isn't any solution directly implemented in Hadoop, I finally ended up creating my own bash script:

#!/bin/bash

# pattern of files to get
TO_GET=("*.csv$" "*.png$")
# pattern of files/directories to avoid
TO_AVOID=("*_temporary*")

# function to join an array by a specified separator:
# usage: join_arr ";" ${array[@]}
join_arr() {
  local IFS="$1"
  shift
  echo "$*"
}

if (($# != 2))
then
    echo "There should be two parameters (path of the directory to get and destination)."
else
    # ensure that the provided path ends with a slash
    [[ "$1" != */ ]] && path="$1/" || path="$1"
    echo "Path to copy: $path"
    # ensure that the provided destination ends with a slash and append result directory name
    [[ "$2" != */ ]] && dest="$2/" || dest="$2"
    dest="$dest$(basename $path)/"
    echo "Destination: $dest"
    # get name of all files matching the patterns
    echo -n "Exploring path to find matching files... "
    readarray -t files < <(hdfs dfs -ls -R "$path" | egrep -v "$(join_arr "|" "${TO_AVOID[@]}")" | egrep "$(join_arr "|" "${TO_GET[@]}")" | awk '{print $NF}' | cut -c $((${#path} 1))-)
    echo "Done!"
    # check if at least one file found
    [ -z "$files" ]  && echo "No files matching the patern."
    # get files one by one
    for file in ${files[@]}
    do
        path_and_file="$path$file"
        dest_and_file="$dest$file"
        # make sure the directory exist on the local file system
        mkdir -p "$(dirname "$dest_and_file")"
        # get file in a separate process to be able to execute the queries in parallel
        (hdfs dfs -copyToLocal -f "$path_and_file" "$dest_and_file" && echo "$file") &
    done
    # wait for all queries to be finished
    wait
fi

You can call the script like that:

$ script.sh "/some/hdfs/path/folder_to_get" "/some/local/path"

The script will create a directory folder_to_get in /some/local/path with all CSV and PNG files, respecting the directory structure.

Note: If you want to get other files than CSV and PNG, just modify the variable TO_GET at the top of the script. You can also modify the TO_AVOID variable to filter directories that you don't want to scan even if they contains CSV or PNG files.

  • Related