I'm trying to get rid of old asset versions. The file naming is strictly as follows:
<timestamp>_<constant><version><assetID>.zip<.extra>
for example, 202201012359_FOOBAR0101234567.zip.done.
<timestamp>
is the datetime when the file has been added to the folder.
<constant>
does not change inside the folder being handled.
<version>
is a two digit number starting from 00, and describes the version of the asset with the <assetID>
.
<extra>
is optional, so extension can be .zip, .zip.done, or .zip.somethingelse.
However, assets may have all three different extensions, and they can exist multiple times with different timestamps. This means that asset may have multiple additional files with the same ID and version number, but timestamp is different.
The goal would be finding the latest version of each asset with the same ID and removing older versions. The version number is what matters, not the timestamp.
Assets are all located in one folder without subfoldering.
Current solution
So far, this is how the goal can be reached:
#!/bin/bash
location="/home/user/FOOBAR"
echo "Deleting older files..."
# Declare variable to print the outcome of removed asset ID's
declare -A assetsRemoved
# The main loop which finds all the files in the folder
find $location -maxdepth 1 -type f -name "*.zip*" -a -name "*FOOBAR*" | while read line; do
# <timestamp>_FOOBAR<iterator><assetId><file-extensions>
# 20201229104919_FOOBAR0300040682.zip.done
# Separate assetId
rest=${line#*'.zip'}
# .done
pos=$(( ${#line} - ${#rest} - 4 ))
# 20201229104919_FOOBAR0300040682<^>.zip.done
assetId=${line:pos-8:8}
# 20201229104919_FOOBAR03<00040682>.zip.done
# Find all files with same assetId
assets="$(find ~ $location -maxdepth 1 -type f -name "*$assetId.zip*" -a -name "*FOOBAR*")"
# Init loop variables
max=-1
mostRecent=""
cleanedOld=0
# Loop all files with same assetId
for file in $assets
do
# Separate basename without extension
basenameNoExt="${file%%.*}"
# <20201229104919_FOOBAR00300040682>.zip.done
# Separate iterator, 2 numbers
iter=${basenameNoExt:${#basenameNoExt}-10:2}
# 20201229104919_FOOBAR0<03>00040682.zip.done
if [[ $iter -gt $max ]]
then
max=$iter
if [[ -n $mostRecent ]]
then
rm $mostRecent*
cleanedOld=1
fi
mostRecent=$basenameNoExt
elif [[ $iter -lt $max ]]
then
[ -f $file ] && rm $basenameNoExt*
cleanedOld=1
fi
# $iter == $max -> same asset with different file extension, leave to be
done
if [[ $max -gt 0 && cleanedOld -gt 0 ]]
then
assetsRemoved[$assetId]=$max
fi
done
for a in "${!assetsRemoved[@]}"; do
echo "Cleaned asset $a from versions lower than ${assetsRemoved[$a]}"
done
The problem
This solution has serious problem: it is slow. As it first finds all the files, takes one and figures out the max version while deleting older versions, the next iteration in the outermost find-loop tries to execute the find-remove-command to assets that may have been already handled or deleted.
The question
Is there a way to execute commands for each result of find
before all the results have been gathered? Or is there some other, more efficient way to loop the results? There are over 100k files to handle, and I assume that wildcard rm
loops them through when searching relevant files to delete. This would require over 100.000^2 iterations through the files. Is there any way to prevent this?
Example
Consider a folder with the following files:
20191229104919_FOOBAR0001234567.zip
20191229104919_FOOBAR0001234567.zip.done
20191229104919_FOOBAR0001234567.zip.somethingelse
20191229104919_FOOBAR0087654321.zip
20191129104919_FOOBAR0087654321.zip.done
20191129104919_FOOBAR0087654321.zip.somethingelse
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20201229104919_FOOBAR0101234567.zip
20201229104919_FOOBAR0101234567.zip.done
20201229104919_FOOBAR0101234567.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse
The remaining files after cleaning would be:
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse
Notice:
Newest version is what matters. Same asset version&ID with different timestamps and extensions must be preserved
CodePudding user response:
Instead of the double loop, how about a 2-pass solution:
#!/bin/bash
location="/home/user/FOOBAR"
declare -A latestver # associates the latest version number with assetID
# pass 1: extract the latest version number for the assetID
for f in "$location"/*FOOBAR*.zip*; do
tmp=${f%.zip*} # remove the suffix ".zip*"
ver=${tmp: -10:2} # extract the version number
id=${tmp: -8:8} # extract the assetID
(( 10#$ver > 10#${latestver[$id]} )) && latestver[$id]="$ver"
# update the latest version number assiciated with the assetID
done
# pass 2: if the associated version number of the assetID does not match, remove the file
for f in "$location"/*FOOBAR*.zip*; do
tmp=${f%.zip*} # remove the suffix ".zip*"
id=${tmp: -8:8} # extract the version number
ver=${latestver[$id]} # expected latest version number
if [[ $f != *FOOBAR$ver$id.zip* ]]; then
# filename does not match, meaning the version number is older
echo rm -- "$f" # then remove the file
fi
done
If the output looks good, drop "echo".
CodePudding user response:
Here's a pure bash-4 solution. It'll first determine the latest version for each asset (using a bash regex and filling a hash in a loop) and then delete the older assets using an extended glob:
#!/bin/bash
shopt -s extglob nullglob
location=assets
constant=FOOBAR
declare -A latest=()
for f in "$location"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_"$constant"[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].zip*(. ([[:alnum:]]))
do
[[ $f =~ _"$constant"([0-9]{2})([0-9]{8})(\.[[:alnum:]] ) $ ]]
version=${BASH_REMATCH[1]}
asset=${BASH_REMATCH[2]}
if [[ -z ${latest[$asset] 1} ]] || (( 10#$version > 10#${latest[$asset]} ))
then
latest[$asset]=$version
fi
done
for asset in "${!latest[@]}"
do
files_to_delete=( "$location"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_"$constant"!("${lastest[$asset]}")"$asset".zip*(. ([[:alnum:]])) )
(( ${#files_to_delete[@]} > 0 )) || continue
printf '%q ' removing "${files_to_delete[@]}"; echo
#rm -f -- "${files_to_delete[@]}"
done
remarks:
- In my code I opted for robust globs but in fact you can use simpler/shorter ones as long as they can differentiate the targeted files from the unrelated ones. For example you might be able to replace the first glob with
"$location"/ ([0-9])_"$constant" ([0-9]) (. ([[:alnum:]])
or"$location"/*"$constant"*
or even"$location"/*
, it all depends on the files/directories that are present in"$location/"