I have a directory traversal function in Haskell, but I want it to ignore symlinks. I figured out how to filter out the files alone, albeit with a slightly inelegant secondary filterM
. But after some diagnosis I realize that I'm failing to filter symlinked directories.
I'd like to be able to write something like this:
-- Lazily return (normal) files from rootdir
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = do
nodes <- pathWalkLazy root
-- get file paths from each node
let files = [dir </> file | (dir, _, files) <- nodes,
file <- files,
not . pathIsSymbolicLink dir]
normalFiles <- filterM (liftM not . pathIsSymbolicLink) files
return normalFiles
However, all the variations I have tried get some version of the "Couldn't match expected type ‘Bool’ with actual type ‘IO Bool’" message (without the filter clause in the comprehension it works, but fails to filter those linked dirs).
Various hints at ways I might completely restructure the function are in partial form at online resources, but I'm pretty sure that every such variation will run into some similar issue. The list comprehension would certainly be the most straightforward way... if I could just somehow exclude those dirs that are links.
Followup: Unfortunately, the solution kindly provided by ChrisB behaves (almost?!) identically to my existing version. I defined three functions, and run them within a test program:
-- XXX: debugging
files <- getAllFilesRaw rootdir
putStrLn ("getAllFilesRaw: " show (length files))
files' <- getAllFilesNoSymFiles rootdir
putStrLn ("getAllFilesNoSymFiles: " show (length files'))
files'' <- getAllFilesNoSymDirs rootdir
putStrLn ("getAllFilesNoSymDirs: " show (length files''))
The first is my version with the normalFiles
filter removed. The second is my original version (minus the type error in the listcomp). The final one is ChrisB's suggestion.
Running that, then also the system find
utility:
% find $CONDA_PREFIX -type f | wc -l
449667
% find -L $CONDA_PREFIX -type f | wc -l
501153
% haskell/find-dups $CONDA_PREFIX
getAllFilesRaw : 501153
getAllFilesNoSymFiles: 464553
getAllFilesNoSymDirs: 464420
Moreover, this question came up because—for my own self-education—I've implemented the same application in a bunch of languages: Python; Golang; Rust; Julia; TypeScript; Bash, except the glitch, Haskell; others are planned. The programs actually do something more with the files, but that's not the point of this question.
The point of this is that ALL other languages report the same number as the system find
tool. Moreover, the specific issue is things like this:
% ls -l /home/dmertz/miniconda3/pkgs/ncurses-6.2-he6710b0_1/lib/terminfo
lrwxrwxrwx 1 dmertz dmertz 17 Apr 29 2020 /home/dmertz/miniconda3/pkgs/ncurses-6.2-he6710b0_1/lib/terminfo -> ../share/terminfo
There are about 16k examples here (on my system currently), but looking at some in the other version of the tool, I see specifically that all the other languages are excluding the contents of that symlink directory.
CodePudding user response:
EDIT:
- Instead of just fixing a Bool / IO Bool issue we now want to mach find's behavior.
- After looking at the documentation, this seems to be quite hard to implement reasonably performantly with the PathWalk library, so i just handrolled it. (Using do-notation, as requested in the comments.) In my quick and dirty tests the results match those of find:
import System.FilePath
import System.Directory
getAllFiles' :: FilePath -> IO [FilePath]
getAllFiles' path = do
isSymlink <- pathIsSymbolicLink path
if isSymlink
-- if this is a symlink, return the empty list.
-- even if this was the original root. (matches find's behavior)
then return []
else do
isFile <- doesFileExist path
if isFile
then return [path] -- if this is a file, return it
else do
-- if it's not a file, we assume it to be a directory
dirContents <- listDirectory path
-- run this function recursively on all the children
-- and accumulate the results
fmap concat $ mapM (getAllFiles' . (path </>)) dirContents
Original Answer solving the IO Bool / Bool issue
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = pathWalkLazy root
-- remove dirs that are symlinks
>>= filterM (\(dir, _, _) -> fmap not $ pathIsSymbolicLink dir)
-- flatten to list of files
>>= return . concat . map (\(dir, _, files) -> map (\f -> dir </> f) files)
-- remove files that are symlinks
>>= filterM (fmap not . pathIsSymbolicLink)