Home > Software design >  Calculate total size of all un-deletable files (including files in subdirectories)
Calculate total size of all un-deletable files (including files in subdirectories)

Time:09-16

I have a somewhat weird problem, where there's a shared storage on a server with hundreds of users. I want to identify all the files that would not be deletable by my user (if I tried to run an auto-delete script).

For all these identified files, I need the long-listing format (ie: full path, owner, last modified, size). My current understanding is this would be all the files whose immediate parent directories are non-writable by me, please correct me if this approach is wrong.

After identifying these files, I also need to get the total size they're occupying. Although, this step can also be done by using a simple python script too add all the sizes, but I'm still looking if there's a better way to do it.

Here's my current approach:

find /downloads/ -type d ! -writable -print0 |
xargs -0 -I{.} find {.} -maxdepth 1 -type f -print0 |
xargs -0 -I{.} ls -l {.} > myfile.txt
  • Find all directories (recursively) that are non-writable by my user
  • Find all the immediate children of each of these directories, and filter out only the files.
  • Get a long-listing for each of these identified files and store results in a .txt file
  • Read this .txt file with python and add all the individual sizes (in bytes) to get the total size. Divide by (1024*1024*1024) to get size in GB. (python script below:)
with open("myfile.txt", "r") as f:
    total_size = 0
    for line in f:
        try:
            data = line.split()
            if len(data) > 5:
                total_size  = int(data[4]) # "4" is the index of the size column
        except:
            pass
print("Total Size:", str(round(total_size/(1024*1024*1024), 2)), "GB")

The problem I'm facing with this approach is that the final size I'm getting is the total size of the entire volume.

My answer = 1400.02GB (as reported by following the steps above)
The total size of the entire shared space = 1400.02GB
Actual occupied space on the shared server = 800GB
Expected answer should be <800GB

So you see the problem is that my answer is equal to the total size of the entire server space, including the space that's not even occupied.


This mismatch between my answer (1400.02GB) and the expected answer (<800GB) ALSO makes me question the correctness of the files that have been identified.

Could someone please suggest a better way to do this, or point out the problem in my approach so that I can fix it. Thanks a lot.

CodePudding user response:

Here are two reasons why just adding bytes reported by ls to calculate disk usage has to be considered naive:

  1. You're double counting hardlinked files, even though a hardlink is just a directory entry taking basically no extra disk space.
  2. If you want to find out the actual disk usage ("get the total size they're occupying") you have to consider the block size. E.g. a directory with 1000 1-byte files can take much more space than a directory with one 1MB file

So the right tool for the job is du. But in order for du to be able to do his job correctly and not double counting hardlinks you've to give it all files to check at once. Luckily the GNU version of du has the --files0-from option for that, e.g. you can use

find . -type d ! -writable -print0 |
du -ch -d 1 --files0-from=- |
tail -n 1

This also avoids the xargs which can be very slow due to the -I option which means you're doing a find invocation for each argument.

CodePudding user response:

I'm not sure why your command gave such a wrong number, but here is another one you could try.

Note that "total size they're occupying" can differ a lot from the sum of sizes printed by ls due to things like file system blocks, hardlinks, and sparse files. So we use du instead of ls to get the sizes.

The following command is basically LMC's comment, but incorporates ls into the same search and uses -exec instead of xargs -I, which should drastically speed up the command, as you run the second find only a handful of times instead of once for each directory.

#! /bin/bash

longlist() {
  ls -l "$@" >> "$longlistfile"
}
filesonly() {
  find "$@" -maxdepth 1 -type f -exec bash -c 'longlist "$@"' . {}   -print0
}
export -f longlist filesonly
export longlistfile=myfile.txt

rm -f "$longlistfile"
find . -type d \! -writable -exec bash -c 'filesonly "$@"' . {}   |
du -ch --files0-from=- | tail -n1
  • Related