Home > Enterprise >  Find same file with same name but in a different directory structure
Find same file with same name but in a different directory structure

Time:06-09

I have a master directory A with high resolution images (around 100Gb) in various subdirectories. I have a selection of those images (same file name) that are lower resolution in another directory B with different subdirectories (a few thousands files).

I would like to be able to get a copy the same directory B structure but replaced with the high resolution version. A proxy for resolution could be file size as there is only one match in directory A.

I have some experience with bash script but I usually start with tutorials and I can't find any. Any pointer are appreciated !

CodePudding user response:

It is somewhat unclear from the question why you need to consider file sizes or resolutions in the script. I’m going to assume that (1) file names are unique across the entire (sub)directory structure under both A and B and (2) A always contains an equal or higher resolution of images, some of which have thumbnails (matched by file name) under B. An outline could look as follows:

replace_files_by_name() {
  local -r dir_A="$1"  # full size ("source")
  local -r dir_B="$2"  # thumbnails ("index")
  local -r dir_C="$3"  # full size copy by index ("destination")
  local path

  # Create an index of file names and paths under $dir_A
  local -A path_index  # maps file names to paths under $dir_A
  while IFS= read -r path; do
    path_index["${path##*/}"]="$path"
  done < <(find "$dir_A" -type f)

  # Make a recursive copy of $dir_B called $dir_C.
  echo cp -a --reflink "$dir_B" "$dir_C"
  cp -a --reflink "$dir_B" "$dir_C"

  # Replace each file under $dir_C with its counterpart from $dir_A.
  find "$dir_C" -type f | while IFS= read -r path; do
    echo cp -a --reflink "${path_index["${path##*/}"]}" "$path"
    cp -a --reflink "${path_index["${path##*/}"]}" "$path"
  done
}

Side note 0: If you have an outdated file system, then you will have to drop the --reflink, at an immense performance and space cost. This is why it’s good to use a reasonably full-featured file system (at least CoW-capable (CoW == Copy on Write)). (Examples include Btrfs or ZFS.)

Side note 1: My outline skips all error checking and needs to be adjusted accordingly. (For example, what should happen when a file from C (B) is not found under A?)

Now let’s test the solution:

set -eu
mkdir -p ~/tmp/test
cd ~/tmp/test

# Create directories A and B and 5 different subdirectories in each.
mkdir -p A/{1..5}/ B/{a..e}/

# Place a file in each subdirectory.
# A and B contain different subdirectory names but same file names.
files=('one' 'two' 'three' 'four' 'five')
for dir in A B; do
  subdirs=("${dir}/"*)
  ((${#subdirs[@]} == ${#files[@]}))
  for ((i = 0; i < ${#files[@]};   i)); do
    touch "${subdirs[i]}/${files[i]}"
  done
done

##############################
replace_files_by_name A B C ##
##############################

rm -Rf ~/tmp/test  # cleanup

This↑↑↑ will output (and also do) the following:

cp -a --reflink B C
cp -a --reflink A/1/one C/a/one
cp -a --reflink A/2/two C/b/two
cp -a --reflink A/3/three C/c/three
cp -a --reflink A/4/four C/d/four
cp -a --reflink A/5/five C/e/five
  • Related