I have > 400.000 files in the shared folder
mol0.pdb
mol1.pdb
mol2.pdb
...
mol999.pdb
...
mol422222.pdb
I need to divide all of this repertoire into 4 equal parts (by the number of the files, assuming that the last part could be alitle bit smaller compared to the rest) and create for each part individual folder (matching the name of the initial folder adding part_N suffix) and copy each part to it. For it I am trying to realize simple bash workflow:
#!/bin/bash
home="$PWD"
project='ALL_pdb' # name of the folder with all pdb filles
#############
input="${home}"/"${project}"
output="${home}"/"${project}"_parts # name of the folder with devided files
# format of the inputs
format='pdb'
# 1- devide all filles in the input to the 4 equal parts
# 2- then iterative over the all filles and copy it to the subfolder
for lig in ${input}/*.${format}; do
lig_name=$(basename "$lig" .${format})
# mkdir $output_part_$i
# cp lig $output_part_$i
# etc
done
How it would be better to automatize the devizion of the filles and its further transfer to the individual folder?
CodePudding user response:
The following code will evenly dispatch the files into directories numbered from 1
to <number of parts>
. The first directories might contain 1 more file than the last ones.
#!/bin/bash
project=ALL_pdb
files=("$project"/*.pdb)
files_count=${#files[@]}
parts_count=4
parts_sizes=()
leftover=$(( files_count % parts_count ))
min_size=$(( files_count / parts_count ))
max_size=$(( min_size 1 ))
p=0
while (( p < leftover )); do parts_sizes[ p]=$max_size; done
while (( p < parts_count )); do parts_sizes[ p]=$min_size; done
for (( i = 0, p = 1; parts_sizes[p]; i = parts_sizes[p ] ))
do
d="${project}_part_$p"
mkdir -p "$d"
printf '%s\0' "${files[@]:i:parts_sizes[p]}" |
# xargs -0 mv -t "$d/" # for GNU
# xargs -0 -J {} cp {} "$d/" # for BSD
done
Explanations
Because you're handling a lot of files, you are confronted with two problems:
- Forking a
cp
command per file will be extreeeeeeeeeeemely slow. - Using a single
cp
command per target directory will fail with anArgument list too long
.
=> Working around that will require the use of xargs
.
Now, for deciding which file will go into which directory, the simplest would be to:
- load all the filepaths into a bash array using a glob,
- then do a few calculations for determining the size of each array slice,
- then use parameter expansions on the array variable for getting the filepaths corresponding to each target directory.