Calculate combined filesize of thousands of files-CodePudding

We have a software package that performs tasks by assigning the batch of files a job number. Batches can have any number of files in them. The files are then stored in a directory structure similar to this:

/asc/array1/.storage/10/10297/10297-Low-res.m4a
...
/asc/array1/.storage/3/3814/3814-preview.jpg

The filename is generated automatically. The directory in .storage is the thousandths digits of the file number.

There is also a database which associates the job number and the file number with the client in question. Running a SQL query, I can list out the job number, client and the full path to the files. Example:

213     sample-data     /asc/array1/.storage/10/10297/10297-Low-res.m4a
...
214     client-abc      /asc/array1/.storage/3/3814/3814-preview.jpg

My task is to calculate the total storage being used per client. So, I wrote a quick and dirty bash script to iterate over every single row and du the file, adding it to an associative array. I then plan to echo this out or produce a CSV file for ingest into PowerBI or some other tool. Is this the best way to handle this? Here is a copy of the script as it stands:

#!/bin/sh

declare -A clientArr

# 1 == Job Num
# 2 == Client
# 3 == Path
while read line; do
    client=$(echo "$line" | awk '{ print $2 }')
    path=$(echo "$line" | awk '{ print $3 }')

    if [ -f "$path" ]; then
        size=$(du -s "$path" | awk '{ print $1 }')
        clientArr[$client]=$((${clientArr[$client]} ${size}))
    fi
done < /tmp/pm_report.txt

for key in "${!clientArr[@]}"; do
    echo "$key,${clientArr[$key]}"
done

CodePudding user response：

Assuming:

you have GNU coreutils du
the filenames do not contain whitespace

This has no shell loops, calls du once, and iterates over the pm_report file twice.

file=/tmp/pm_report.txt

awk '{printf '%s\0', $3}' "$file" \
| du -s --files0-from=- 2>/dev/null \
| awk '
    NR == FNR {du[$2] = $1; next}
    {client_du[$2]  = du[$3]}
    END {
      OFS = "\t"
      for (client in client_du) print client, client_du[client]
    }
  ' - "$file"

CodePudding user response：

Using file foo:

$ cat foo
213     sample-data     foo          # this file
214     client-abc      bar          # some file I had in the dir
215     some            nonexistent  # didn't have this one

and the awk:

$ gawk '                             # using GNU awk
@load "filefuncs"                    # for this default extension
!stat($3,statdata) {                 # "returns zero upon success"
    a[$2] =statdata["size"]          # get the size and update array
}
END {                                # in the end
    for(i in a)                      # iterate all
        print i,a[i]                 # and output
}' foo foo                           # running twice for testing array grouping

Output:

client-abc 70
sample-data 18