I have a somewhat weird problem, where there's a shared storage on a server with hundreds of users. I want to identify all the files that would not be deletable by my user (if I tried to run an auto-delete script).
For all these identified files, I need the long-listing format (ie: full path, owner, last modified, size). My current understanding is this would be all the files whose immediate parent directories are non-writable by me, please correct me if this approach is wrong.
After identifying these files, I also need to get the total size they're occupying. Although, this step can also be done by using a simple python script too add all the sizes, but I'm still looking if there's a better way to do it.
Here's my current approach:
find /downloads/ -type d ! -writable -print0 |
xargs -0 -I{.} find {.} -maxdepth 1 -type f -print0 |
xargs -0 -I{.} ls -l {.} > myfile.txt
- Find all directories (recursively) that are non-writable by my user
- Find all the immediate children of each of these directories, and filter out only the files.
- Get a long-listing for each of these identified files and store results in a .txt file
- Read this .txt file with python and add all the individual sizes (in bytes) to get the total size. Divide by (1024*1024*1024) to get size in GB. (python script below:)
with open("myfile.txt", "r") as f:
total_size = 0
for line in f:
try:
data = line.split()
if len(data) > 5:
total_size = int(data[4]) # "4" is the index of the size column
except:
pass
print("Total Size:", str(round(total_size/(1024*1024*1024), 2)), "GB")
The problem I'm facing with this approach is that the final size I'm getting is the total size of the entire volume.
My answer = 1400.02GB (as reported by following the steps above)
The total size of the entire shared space = 1400.02GB
Actual occupied space on the shared server = 800GB
Expected answer should be <800GB
So you see the problem is that my answer is equal to the total size of the entire server space, including the space that's not even occupied.
This mismatch between my answer (1400.02GB) and the expected answer (<800GB) ALSO makes me question the correctness of the files that have been identified.
Could someone please suggest a better way to do this, or point out the problem in my approach so that I can fix it. Thanks a lot.
CodePudding user response:
Here are two reasons why just adding bytes reported by ls
to calculate disk usage has to be considered naive:
- You're double counting hardlinked files, even though a hardlink is just a directory entry taking basically no extra disk space.
- If you want to find out the actual disk usage ("get the total size they're occupying") you have to consider the block size. E.g. a directory with 1000 1-byte files can take much more space than a directory with one 1MB file
So the right tool for the job is du
. But in order for du
to be able to do his job correctly and not double counting hardlinks you've to give it all files to check at once. Luckily the GNU version of du
has the --files0-from
option for that, e.g. you can use
find . -type d ! -writable -print0 |
du -ch -d 1 --files0-from=- |
tail -n 1
This also avoids the xargs
which can be very slow due to the -I
option which means you're doing a find
invocation for each argument.
CodePudding user response:
I'm not sure why your command gave such a wrong number, but here is another one you could try.
Note that "total size they're occupying" can differ a lot from the sum of sizes printed by ls
due to things like file system blocks, hardlinks, and sparse files. So we use du
instead of ls
to get the sizes.
The following command is basically LMC's comment, but incorporates ls
into the same search and uses -exec
instead of xargs -I
, which should drastically speed up the command, as you run the second find
only a handful of times instead of once for each directory.
#! /bin/bash
longlist() {
ls -l "$@" >> "$longlistfile"
}
filesonly() {
find "$@" -maxdepth 1 -type f -exec bash -c 'longlist "$@"' . {} -print0
}
export -f longlist filesonly
export longlistfile=myfile.txt
rm -f "$longlistfile"
find . -type d \! -writable -exec bash -c 'filesonly "$@"' . {} |
du -ch --files0-from=- | tail -n1