Home > Enterprise >  getAllFiles (but not symlinks)
getAllFiles (but not symlinks)

Time:09-17

I have a directory traversal function in Haskell, but I want it to ignore symlinks. I figured out how to filter out the files alone, albeit with a slightly inelegant secondary filterM. But after some diagnosis I realize that I'm failing to filter symlinked directories.

I'd like to be able to write something like this:

-- Lazily return (normal) files from rootdir
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = do
  nodes <- pathWalkLazy root
  -- get file paths from each node
  let files = [dir </> file | (dir, _, files) <- nodes,
                              file <- files,
                              not . pathIsSymbolicLink dir]
  normalFiles <- filterM (liftM not . pathIsSymbolicLink) files
  return normalFiles

However, all the variations I have tried get some version of the "Couldn't match expected type ‘Bool’ with actual type ‘IO Bool’" message (without the filter clause in the comprehension it works, but fails to filter those linked dirs).

Various hints at ways I might completely restructure the function are in partial form at online resources, but I'm pretty sure that every such variation will run into some similar issue. The list comprehension would certainly be the most straightforward way... if I could just somehow exclude those dirs that are links.


Followup: Unfortunately, the solution kindly provided by ChrisB behaves (almost?!) identically to my existing version. I defined three functions, and run them within a test program:

-- XXX: debugging
files <- getAllFilesRaw rootdir
putStrLn ("getAllFilesRaw:        "    show (length files))
files' <- getAllFilesNoSymFiles rootdir
putStrLn ("getAllFilesNoSymFiles: "    show (length files'))
files'' <- getAllFilesNoSymDirs rootdir
putStrLn ("getAllFilesNoSymDirs:  "    show (length files''))

The first is my version with the normalFiles filter removed. The second is my original version (minus the type error in the listcomp). The final one is ChrisB's suggestion.

Running that, then also the system find utility:

% find $CONDA_PREFIX -type f | wc -l
449667
% find -L $CONDA_PREFIX -type f | wc -l
501153
% haskell/find-dups $CONDA_PREFIX
getAllFilesRaw   :     501153
getAllFilesNoSymFiles: 464553
getAllFilesNoSymDirs:  464420

Moreover, this question came up because—for my own self-education—I've implemented the same application in a bunch of languages: Python; Golang; Rust; Julia; TypeScript; Bash, except the glitch, Haskell; others are planned. The programs actually do something more with the files, but that's not the point of this question.

The point of this is that ALL other languages report the same number as the system find tool. Moreover, the specific issue is things like this:

% ls -l /home/dmertz/miniconda3/pkgs/ncurses-6.2-he6710b0_1/lib/terminfo
lrwxrwxrwx 1 dmertz dmertz 17 Apr 29  2020 /home/dmertz/miniconda3/pkgs/ncurses-6.2-he6710b0_1/lib/terminfo -> ../share/terminfo

There are about 16k examples here (on my system currently), but looking at some in the other version of the tool, I see specifically that all the other languages are excluding the contents of that symlink directory.

CodePudding user response:

EDIT:

  • Instead of just fixing a Bool / IO Bool issue we now want to mach find's behavior.
  • After looking at the documentation, this seems to be quite hard to implement reasonably performantly with the PathWalk library, so i just handrolled it. (Using do-notation, as requested in the comments.) In my quick and dirty tests the results match those of find:
import System.FilePath
import System.Directory
getAllFiles' :: FilePath -> IO [FilePath]
getAllFiles' path = do
    isSymlink <- pathIsSymbolicLink path
    if isSymlink 
        -- if this is a symlink, return the empty list. 
        -- even if this was the original root. (matches find's behavior)
        then return [] 
        else do
            isFile <- doesFileExist path
            if isFile 
                then return [path] -- if this is a file, return it
                else do
                    -- if it's not a file, we assume it to be a directory
                    dirContents <- listDirectory path
                    -- run this function recursively on all the children
                    -- and accumulate the results
                    fmap concat $ mapM (getAllFiles' . (path </>)) dirContents

Original Answer solving the IO Bool / Bool issue

getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = pathWalkLazy root
    -- remove dirs that are symlinks
    >>= filterM (\(dir, _, _) -> fmap not $ pathIsSymbolicLink dir) 
    -- flatten to list of files
    >>= return . concat . map (\(dir, _, files) -> map (\f -> dir </> f) files) 
    -- remove files that are symlinks
    >>= filterM (fmap not . pathIsSymbolicLink)
  • Related