Home > Net >  bash: devidion of large repertoire to sub-folders
bash: devidion of large repertoire to sub-folders

Time:06-10

I have > 400.000 files in the shared folder

mol0.pdb
mol1.pdb
mol2.pdb
...
mol999.pdb
...
mol422222.pdb 

I need to divide all of this repertoire into 4 equal parts (by the number of the files, assuming that the last part could be alitle bit smaller compared to the rest) and create for each part individual folder (matching the name of the initial folder adding part_N suffix) and copy each part to it. For it I am trying to realize simple bash workflow:

#!/bin/bash
home="$PWD"
project='ALL_pdb' # name of the folder with all pdb filles
#############
input="${home}"/"${project}"
output="${home}"/"${project}"_parts # name of the folder with devided files
# format of the inputs
format='pdb'
# 1- devide all filles in the input to the 4 equal parts

# 2- then iterative over the all filles and copy it to the subfolder
for lig in ${input}/*.${format}; do
lig_name=$(basename "$lig" .${format})
# mkdir $output_part_$i
# cp lig $output_part_$i
# etc
done

How it would be better to automatize the devizion of the filles and its further transfer to the individual folder?

CodePudding user response:

The following code will evenly dispatch the files into directories numbered from 1 to <number of parts>. The first directories might contain 1 more file than the last ones.

#!/bin/bash
project=ALL_pdb

files=("$project"/*.pdb)
files_count=${#files[@]}

parts_count=4
parts_sizes=()

leftover=$(( files_count % parts_count ))
min_size=$(( files_count / parts_count )) 
max_size=$(( min_size   1 ))

p=0
while (( p < leftover ));    do parts_sizes[  p]=$max_size; done
while (( p < parts_count )); do parts_sizes[  p]=$min_size; done

for (( i = 0, p = 1; parts_sizes[p]; i  = parts_sizes[p  ] ))
do
    d="${project}_part_$p"
    mkdir -p "$d"

    printf '%s\0' "${files[@]:i:parts_sizes[p]}" |
#   xargs -0 mv -t "$d/"       # for GNU
#   xargs -0 -J {} cp {} "$d/" # for BSD
done
Explanations

Because you're handling a lot of files, you are confronted with two problems:

  • Forking a cp command per file will be extreeeeeeeeeeemely slow.
  • Using a single cp command per target directory will fail with an Argument list too long.

=> Working around that will require the use of xargs.

Now, for deciding which file will go into which directory, the simplest would be to:

  • load all the filepaths into a bash array using a glob,
  • then do a few calculations for determining the size of each array slice,
  • then use parameter expansions on the array variable for getting the filepaths corresponding to each target directory.
  •  Tags:  
  • bash
  • Related