How can i sort a Array based on a not integer Substring in Bash?-CodePudding

I wrote a cleanup Script to delete some certain files. The files are stored in Subfolders. I use find to get those files into a Array and its recursive because of find. So an Array entry could look like this: (path to File)

./2021_11_08_17_28_45_1733556/2021_11_12_04_15_51_1733556_0.jfr

As you can see the filenames are Timestamps. Find sorts by the Folder name only (./2021_11_08_17_28_45_1733556) but I need to sort all Files which can be in different Folders by the timestamp only of the files and not of the folders (they can be completely ignored), so I can delete the oldest files first. Here you can find my Script at the not properly working state, I need to add some sorting to fix my problems.
Any Ideas?

#!/bin/bash

# handle -h (help)
if [[ "$1" == "-h" || "$1" == "" ]]; then 
  echo -e '-p [Pfad zum Zielordner] \n-f [Anzahl der Files welche noch im Ordner vorhanden sein sollen] \n-d [false um dryRun zu deaktivieren]'
  exit 0
fi

# handle parameters
while getopts p:f:d: flag 
do
    case "${flag}" in
        p) pathToFolder=${OPTARG};;
        f) maxFiles=${OPTARG};;
        d) dryRun=${OPTARG};;
        *) echo -e '-p [Pfad zum Zielordner] \n-f [Anzahl der Files welche noch im Ordner vorhanden sein sollen] \n-d [false um dryRun zu deaktivieren]'
    esac
done

if [[ -z $dryRun ]]; then
    dryRun=true
fi


# fill array specified by .jfr files an sorted that the oldest files get deleted first
fillarray() { 
    files=($(find -name "*.jfr" -type f))
    totalFiles=${#files[@]}
}

# Return size of file
getfilesize() {  
    filesize=$(du -k "$1" | cut -f1)
}

count=0

checkfiles() {
    
    # Check if File matches the maxFiles parameter
    if [[ ${#files[@]} -gt $maxFiles ]]; then 
         # Check if dryRun is enabled
        if [[ $dryRun == "false" ]]; then 
            echo "msg=\"Removal result\", result=true, file=$(realpath $1) filesize=$(getfilesize $1), reason=\"outside max file boundary\""
            ((count  ))
            rm $1
        else 
            ((count  ))
            echo msg="\"Removal result\", result=true, file=$(realpath $1 ) filesize=$(getfilesize $1), reason=\"outside max file boundary\""
        fi
    # Remove the file from the files array
    files=(${files[@]/$1}) 
    else 
        echo msg="\"Removal result\", result=false, file=$( realpath $1), reason=\"within max file boundary\""
    fi
}

# Scan for empty files
scanfornullfiles() { 
    for file in "${files[@]}"
    do
        filesize=$(! getfilesize $file)
        if [[ $filesize == 0 ]]; then
            files=(${files[@]/$file})
            echo msg="\"Removal result\", result=false, file=$(realpath $file), reason=\"empty file\""
        fi
    done
}

echo msg="jfrcleanup.sh started", maxFiles=$maxFiles, dryRun=$dryRun, directory=$pathToFolder
{
    cd $pathToFolder > /dev/null 2>&1
} || {
    echo msg="no permission in directory"
    echo msg="jfrcleanup.sh stopped"
    exit 0
}

fillarray #> /dev/null 2>&1


scanfornullfiles
for file in "${files[@]}"
    do
        checkfiles $file
    done
echo msg="\"jfrcleanup.sh finished\", totalFileCount=$totalFiles filesRemoved=$count"

CodePudding user response：

Assuming the file paths do not contain newline characters, would tou please try the following Schwartzian transform method:

#!/bin/bash

pat="/([0-9]{4}(_[0-9]{2}){5})[^/]*\.jfr$"
while IFS= read -r -d "" path; do
    if [[ $path =~ $pat ]]; then
        printf "%s\t%s\n" "${BASH_REMATCH[1]}" "$path"
    fi
done < <(find . -type f -name "*.jfr" -print0) | sort -k1,1 | head -n 1 | cut -f2- | tr "\n" "\0" | xargs -0 echo rm

The string pat is a regex pattern to extract the timestamp from the filename such as 2021_11_12_04_15_51.
Then the timestamp is prepended to the filename delimited by a tab character.
The output lines are sorted by the timestamp in ascending order (oldest first).
head -n 1 picks the oldest line. If you want to change the number of files to remove, modify the number to the -n option.
cut -f2- drops the timestamp to retrieve the filename.
tr "\n" "\0" protects the filenames which contain whitespaces or tab characters.
xargs -0 echo rm just outputs the command lines as a dry run. If the output looks good, drop echo.

CodePudding user response：

If you have GNU find, and pathnames don't contain new-line ('\n') and tab ('\t') characters, the output of this command will be ordered by basenames:

find path/to/dir -type f -printf '%f\t%p\n' | sort | cut -f2-

CodePudding user response：

TL;DR but Since you're using find and if it supports the -printf flag/option something like.

find . -type f -name '*.jfr' -printf '%f/%h/%f\n' | sort -k1 -n | cut -d '/' -f2-

Otherwise a while read loop with another -printf option.

#!/usr/bin/env bash

while IFS='/' read -rd '' time file; do
  printf '%s\n' "$file"
done < <(find . -type f -name '*.jfr' -printf '%T@/%p\0' | sort -zn)

That is -printf from find and the -z flag from sort is a GNU extension.

Saving the file names you could change

printf '%s\n' "$file"

To something like, which is an array named files

files =("$file")

Then "${files[@]}" has the file names as elements.

The last code with a while read loop does not depend on the file names but the time stamp from GNU find.

CodePudding user response：

I solved the problem! I sort the array with the following so the oldest files will be deleted first:

files=($(printf '%s\n' "${files[@]}" | sort -t/ -k3))

Link to Solution