Home > Enterprise >  Finding files that are *not* hard links or under hard links directory via a shell script
Finding files that are *not* hard links or under hard links directory via a shell script

Time:06-14

I would like to find all files not a hard link or under a hard link directory. I found this awesome SO but below command do not handle the case under hard link directory!

find /1 -type f -links 1 -print

for example:

/1/2/3/test.txt
/1/A/3/test.txt

2 is hard link to A, then we only expect find one test.txt file.

One more example from android:

$ adb shell ls -li /data/data/com.android.nfc |grep files
4243 drwxrwx--x 2 nfc  nfc  3488 2022-06-13 11:08 files
$ adb shell ls -li /data/user/0/com.android.nfc |grep files
4243 drwxrwx--x 2 nfc  nfc  3488 2022-06-13 11:08 files
$ adb shell ls -li /data/data/com.android.nfc/files/service_state.xml
5877 -rw------- 1 nfc nfc 100 2022-06-13 11:08 /data/data/com.android.nfc/files/service_state.xml
$ adb shell ls -li /data/user/0/com.android.nfc/files/service_state.xml
5877 -rw------- 1 nfc nfc 100 2022-06-13 11:08 /data/user/0/com.android.nfc/files/service_state.xml

CodePudding user response:

Systems that support unrestricted hard links to directories are rare, but a similar situation can be created using bind mounts. (See What is a bind mount?.)

Try this Shellcheck-clean code to list files under the current directory that do not have multiple paths (caused by bind mounts or links to directories):

#! /bin/bash -p

shopt -s lastpipe

declare -A devino_of_file
declare -A count_of_devino
find . -type f -printf '%D.%i-%p\0' \
    |   while IFS= read -r -d '' devino_path; do
            devino=${devino_path%%-*}
            path=${devino_path#*-}
            devino_of_file[$path]=$devino
            count_of_devino[$devino]=$(( ${count_of_devino[$devino]-0} 1 ))
        done

for path in "${!devino_of_file[@]}"; do
    devino=${devino_of_file[$path]}
    (( ${count_of_devino[$devino]} == 1 )) && printf '%s\n' "$path"
done
  • shopt -s lastpipe ensures that variables set in the while loop in the pipeline persist after the pipeline completes. It requires Bash 4.2 (released in 2011) or later.
  • The code uses "devino" values. The devino value for a path consists of the device number and inode number for the path, separated by a . character. A devino string should uniquely identify a file on a system, independent of any path to it.
  • The devino_of_file associative array maps paths to the corresponding devino values.
  • The count_of_devino associative array maps devino strings to counts of the number of paths found to them.
  • See BashFAQ/001 (How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?) for an explanation of while IFS= read -r -d '' ....
  • When all files in the directory tree have been processed, all paths whose devino value have a count of 1 (meaning that no other path has been found to the same file) are printed.
  • The code that populates the associative arrays can handle arbitrary paths (including ones that contain spaces or newlines) but the output will be useless if any of the paths contain newlines (because of the '%s\n' format string).
  • Alternative paths caused by symlinks are automatically avoided because find doesn't follow symlinks by default. The code should still work if the -follow option to find is used though. (It's easier to test with symlinks than with directory hardlinks or bind mounts.)

Note that Bash code runs very slowly. It is interpreted in a very laborious way. The code above is likely to be too slow if the directory tree being processed has large numbers of files. For example, it processes files at a rate of around 10 thousand per second on my test VM.

CodePudding user response:

Forgive the humor in the comment, but *I don't think you understand your question."

What I mean by that is that when you create a file, it's a link.

$: date > file1
$: ls -l file1 # note the 2nd field - the "number of hard links"
-rw-r--r--. 1 P2759474 518 29 Jun 13 17:34 file1

You think of file1 as the file, but it's ...complicated, lol.

The date command above creates output. The redirection tells "the system" that you want that data in "a file", so it allocates space on the disk, writes the data to that space, and creates an inode that defines the "file".

A "hard link" is basically just a link to that data. It's the same "file" with another name if you make another link. Editing either edits both (all, if you make several), because they are the same file.

$: date >file1
$: ln file1 file2
$: diff file?
$: cat file1
Mon Jun 13 17:30:22 GMT 2022
$: date >file2
$: diff file?
$: cat file1
Mon Jun 13 17:31:06 GMT 2022

Now, a symlink is another file of another kind with a different inode, containing the name of the file it "links" to symbolically, but a hard link is the file. ls -i will show you the inode index number, in fact.

$: date >file1
$: ln file1 file2
$: diff file?
$: cat file2
Mon Jun 13 17:34:41 GMT 2022
$: ls -li file? # note the 1st and 3rd fields
24415801 -rw-r--r--. 2 paul 518 29 Jun 13 17:34 file1
24415801 -rw-r--r--. 2 paul 518 29 Jun 13 17:34 file2
$: rm file2
$: ls -li file? # note the 1st and 3rd fields
24415801 -rw-r--r--. 1 P2759474 518 29 Jun 13 17:34 file1

Let's make a different file with that name and compare again.

$: date >file2
$: cat file? # not linked now
Mon Jun 13 17:34:41 GMT 2022
Mon Jun 13 17:41:23 GMT 2022
$: diff file? # now they differ
1c1
< Mon Jun 13 17:34:41 GMT 2022
---
> Mon Jun 13 17:41:23 GMT 2022
$: ls -li file? # and have different inodes, one link each
24415801 -rw-r--r--. 1 P2759474 518 29 Jun 13 17:34 file1
24419687 -rw-r--r--. 1 P2759474 518 29 Jun 13 17:41 file2

If I cad copied the original data the diff would have been empty, but it would still be a different inode, so a different file, and I could have edited them independently.

And a symlink -

$: ln -s file1 file3
$: diff file1 file3
$: ls -li file?
24415801 -rw-r--r--. 1 P2759474 518 29 Jun 13 17:34 file1
24419687 -rw-r--r--. 1 P2759474 518 29 Jun 13 17:41 file2
24419696 lrwxrwxrwx. 1 P2759474 518  5 Jun 13 17:44 file3 -> file1

Opening a symlink will usually open the file it targets, but it might depend on what tool you are using... be aware of the differences

You cannot create a hard link to a file on a separate filesystem, because it doesn't work that way. You can use a symlink.

What you might be looking for is

for f in *; [[ -f "$f" ]] && echo "$f"; done

or something like that.

Hope that helps.

CodePudding user response:

From comments on the previous edit of this answer, it seems that the duplication is being caused because some files appear in two different places in the filesystem due to bind mounts.

That being the case, the original code you used produces technically correct output. However it is listing some relevant files more than once (because they have multiple names):

find /1 -type f -links 1 -print

A mounted filesystem is uniquely identified by its device number. A file is uniquely identified within that filesystem by its inode number. So a file can be uniquely identified on a particular host by the (device#,inode#) tuple. (GNU) find can provide these tuples along with filenames, as @pjh's answer shows:

find /1 -type f -links 1 -printf '%D.%i %p\0'

A simple (GNU) awk script can filter the output so that only one path is listed for each unique (device#,inode#):

find /1 -type f -links 1 -printf '%D.%i %p\0' |
gawk -v RS='\0' '!id[$1]   && sub(/^[0-9.]  /,"")'

This uses the common awk idiom !x[y] which evaluates to true only when the element y is inserted into the array x (it is inserted with value 0 the first time y is seen and the value is incremented thereafter; !0 is true). The (device#,inode#) prefix is deleted by sub(). awk implicitly prints processed records if the "pattern" evaluates to true. ie. when a (device#,inode#) tuple is first seen and the prefix is successfully stripped. The (GNU) find output is delimited by nulls rather than newline, so the (GNU) awk script sets the input record separator RS to null also.

  •  Tags:  
  • bash
  • Related