Home > Enterprise >  Find and execute line by line before final result in bash
Find and execute line by line before final result in bash

Time:09-06

I'm trying to get rid of old asset versions. The file naming is strictly as follows:
<timestamp>_<constant><version><assetID>.zip<.extra>
for example, 202201012359_FOOBAR0101234567.zip.done.
<timestamp> is the datetime when the file has been added to the folder.
<constant> does not change inside the folder being handled.
<version> is a two digit number starting from 00, and describes the version of the asset with the <assetID>.
<extra> is optional, so extension can be .zip, .zip.done, or .zip.somethingelse.
However, assets may have all three different extensions, and they can exist multiple times with different timestamps. This means that asset may have multiple additional files with the same ID and version number, but timestamp is different.
The goal would be finding the latest version of each asset with the same ID and removing older versions. The version number is what matters, not the timestamp.
Assets are all located in one folder without subfoldering.

Current solution

So far, this is how the goal can be reached:

#!/bin/bash
location="/home/user/FOOBAR"
echo "Deleting older files..."

# Declare variable to print the outcome of removed asset ID's
declare -A assetsRemoved

# The main loop which finds all the files in the folder
find $location -maxdepth 1 -type f -name "*.zip*" -a -name "*FOOBAR*" | while read line; do
    # <timestamp>_FOOBAR<iterator><assetId><file-extensions>
    # 20201229104919_FOOBAR0300040682.zip.done

    # Separate assetId
    rest=${line#*'.zip'}
    # .done
    pos=$(( ${#line} - ${#rest} - 4 ))
    # 20201229104919_FOOBAR0300040682<^>.zip.done
    assetId=${line:pos-8:8}
    # 20201229104919_FOOBAR03<00040682>.zip.done

    # Find all files with same assetId
    assets="$(find ~  $location -maxdepth 1 -type f -name "*$assetId.zip*" -a -name "*FOOBAR*")"
    # Init loop variables
    max=-1
    mostRecent=""
    cleanedOld=0

    # Loop all files with same assetId
    for file in $assets
    do
        # Separate basename without extension
        basenameNoExt="${file%%.*}"
        # <20201229104919_FOOBAR00300040682>.zip.done
        # Separate iterator, 2 numbers
        iter=${basenameNoExt:${#basenameNoExt}-10:2}
        # 20201229104919_FOOBAR0<03>00040682.zip.done
        if [[ $iter -gt $max ]]
        then
            max=$iter
            if [[ -n $mostRecent ]]
            then
                rm $mostRecent*
                cleanedOld=1
            fi
            mostRecent=$basenameNoExt
        elif [[ $iter -lt $max ]]
        then
            [ -f $file ] && rm $basenameNoExt* 
            cleanedOld=1
        fi
        # $iter == $max -> same asset with different file extension, leave to be
    done

    if [[ $max -gt 0  && cleanedOld -gt 0 ]] 
    then
        assetsRemoved[$assetId]=$max
    fi

done

for a in "${!assetsRemoved[@]}"; do
    echo "Cleaned asset $a from versions lower than ${assetsRemoved[$a]}"
done

The problem

This solution has serious problem: it is slow. As it first finds all the files, takes one and figures out the max version while deleting older versions, the next iteration in the outermost find-loop tries to execute the find-remove-command to assets that may have been already handled or deleted.

The question

Is there a way to execute commands for each result of find before all the results have been gathered? Or is there some other, more efficient way to loop the results? There are over 100k files to handle, and I assume that wildcard rm loops them through when searching relevant files to delete. This would require over 100.000^2 iterations through the files. Is there any way to prevent this?

Example

Consider a folder with the following files:

20191229104919_FOOBAR0001234567.zip
20191229104919_FOOBAR0001234567.zip.done
20191229104919_FOOBAR0001234567.zip.somethingelse
20191229104919_FOOBAR0087654321.zip
20191129104919_FOOBAR0087654321.zip.done
20191129104919_FOOBAR0087654321.zip.somethingelse
20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20201229104919_FOOBAR0101234567.zip
20201229104919_FOOBAR0101234567.zip.done
20201229104919_FOOBAR0101234567.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse

The remaining files after cleaning would be:

20191129100000_FOOBAR0187654321.zip
20191229100000_FOOBAR0187654321.zip.done
20191229100000_FOOBAR0187654321.zip.somethingelse
20211229104919_FOOBAR0201234567.zip
20211229104919_FOOBAR0201234567.zip.done
20211229104919_FOOBAR0201234567.zip.somethingelse
20221229104919_FOOBAR0201234567.zip
20221229104919_FOOBAR0201234567.zip.done
20221229104919_FOOBAR0201234567.zip.somethingelse

Notice:
Newest version is what matters. Same asset version&ID with different timestamps and extensions must be preserved

CodePudding user response:

Instead of the double loop, how about a 2-pass solution:

#!/bin/bash

location="/home/user/FOOBAR"
declare -A latestver            # associates the latest version number with assetID

# pass 1: extract the latest version number for the assetID
for f in "$location"/*FOOBAR*.zip*; do
    tmp=${f%.zip*}              # remove the suffix ".zip*"
    ver=${tmp: -10:2}           # extract the version number
    id=${tmp: -8:8}             # extract the assetID
    (( 10#$ver > 10#${latestver[$id]} )) && latestver[$id]="$ver"
                                # update the latest version number assiciated with the assetID
done

# pass 2: if the associated version number of the assetID does not match, remove the file
for f in "$location"/*FOOBAR*.zip*; do
    tmp=${f%.zip*}              # remove the suffix ".zip*"
    id=${tmp: -8:8}             # extract the version number
    ver=${latestver[$id]}       # expected latest version number
    if [[ $f != *FOOBAR$ver$id.zip* ]]; then
                                # filename does not match, meaning the version number is older
        echo rm -- "$f"         # then remove the file
    fi
done

If the output looks good, drop "echo".

CodePudding user response:

Here's a pure bash-4 solution. It'll first determine the latest version for each asset (using a bash regex and filling a hash in a loop) and then delete the older assets using an extended glob:

#!/bin/bash
shopt -s extglob nullglob

location=assets
constant=FOOBAR

declare -A latest=()

for f in "$location"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_"$constant"[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].zip*(. ([[:alnum:]]))
do
    [[ $f =~ _"$constant"([0-9]{2})([0-9]{8})(\.[[:alnum:]] ) $ ]]

    version=${BASH_REMATCH[1]}
    asset=${BASH_REMATCH[2]}

    if [[ -z ${latest[$asset] 1} ]] || (( 10#$version > 10#${latest[$asset]} ))
    then
        latest[$asset]=$version
    fi
done

for asset in "${!latest[@]}"
do
     files_to_delete=( "$location"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_"$constant"!("${lastest[$asset]}")"$asset".zip*(. ([[:alnum:]])) )

     (( ${#files_to_delete[@]} > 0 )) || continue

     printf '%q ' removing "${files_to_delete[@]}"; echo
     #rm -f -- "${files_to_delete[@]}"
done

remarks:

  • In my code I opted for robust globs but in fact you can use simpler/shorter ones as long as they can differentiate the targeted files from the unrelated ones. For example you might be able to replace the first glob with "$location"/ ([0-9])_"$constant" ([0-9]) (. ([[:alnum:]]) or "$location"/*"$constant"* or even "$location"/*, it all depends on the files/directories that are present in "$location/"
  • Related