I'm played with Haskell, but only just. Working with true immutability is confusing to me.
Specifically, I have the following function (right now it's largely debugging stuff I've thrown in.
type BySize = Map Int [Finfo]
-- ... other stuff ...
-- Walk directories and return map
walkDir :: String -> BySize -> IO ([BySize])
walkDir rootdir bySize = do
let !bySizeHist = [bySize]
pathWalk rootdir (\root dirs files -> do
forM_ files $ (\file -> do
let !latest = head bySizeHist
finfo <- do processPath (joinPath [root, file]) latest
let !new = addBySize (f_size finfo) finfo latest
let latest_size = Map.keys latest
let new_size = Map.keys new
let error = if latest == new
then
"Error, identical maps!"
else
"Update of map is fine" (show latest_size) (show new_size)
putStrLn error
let !bySizeHist = [new] bySizeHist
putStrLn (fname finfo) ))
return bySizeHist
Basically, my goal is to get a Map
that has file size for keys, and a list of Finfo (file info) data
structures as values. I tried a lot of different variations, this is merely the latest one that does not work.
I know that Maps are immutable, so I was hoping to generate a list of versions, and then utilize the latest one downstream. But I think maybe I should be using the State monad instead. I don't actually care about the history of Map versions, I was merely trying that in a fumbling approach.
The function addBySize
works by itself. That is, given size, and a new Finfo object, it correctly returns a new Map based on the old one, but with either a new key added or the list that the existing key maps to expanded with the new Finfo object.
The problem is that the attempt to "rebind" bySizeHist
fails (I think because of falling out of scope within the loop). So whereas I'd like to keep echoing an expanding list of keys during each pass through the loop, instead I get something like:
% haskell/find-dups haskell
Update of map is fine[][6]
/home/dmertz/git/LanguagePractice/haskell/that
Update of map is fine[][3235]
/home/dmertz/git/LanguagePractice/haskell/sha1sum.hi
Update of map is fine[][8160]
/home/dmertz/git/LanguagePractice/haskell/sha1sum.o
Update of map is fine[][241]
/home/dmertz/git/LanguagePractice/haskell/sha1sum.hs
Update of map is fine[][6]
I.e. latest
is never really the latest version of the Map, but I always add new
on each loop, but always to the empty BySize Map.
The solution proposed below is amazingly helpful. However, I wish to exclude symbolic links.
I modified getAllFiles
somewhat, to try to exclude symbolic links. But my approach fails to exclude directories that are symbolic links. I tried some variations that do not work. The version I have that only partially works:
-- Lazily return (normal) files from rootdir
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = do
nodes <- pathWalkLazy root
-- get file paths from each node
let files = [dir </> f | (dir, _, files) <- nodes, f <- files ]
normalFiles <- filterM (liftM not . pathIsSymbolicLink) files
return normalFiles
CodePudding user response:
I'll let someone else provide a direct answer to your question, but the right way to do this is probably not to do this. The program you want to write is:
getBySize :: FilePath -> IO BySize
getBySize root = do
-- first, get all the files
files <- getAllFiles root
-- convert them all to finfos
finfos <- mapM getFinfo files
-- get a list of size/finfo pairs
let pairs = [(f_size finfo, finfo) | finfo <- finfos]
-- convert it to a map, allowing duplicate keys
return $ fromListWithDuplicates pairs
This is a reasonable, functional way of accomplishing your goal. You grab all the filenames at once and apply some functional transformations (to Finfos, to pairs, to a Map). No need to fuss with mutability or state.
Writing fromListWithDuplicates
is a little complicated, but it's standard. It gets rewritten so often that it, or something like it, should probably be part of Data.Map
:
fromListWithDuplicates :: Ord k => [(k, v)] -> Map k [v]
fromListWithDuplicates pairs = Map.fromListWith ( ) [(k, [v]) | (k, v) <- pairs]
The idea is that it takes the list of key-value pairs, converts all the values to singleton lists and then uses fromListWith
to produce a map by concatenating those singletons together in case of duplicates.
You probably already have a getFinfo
function, whatever your Finfo
is. I used the following for testing:
data Finfo = Finfo { f_path :: FilePath, f_size :: Int }
getFinfo :: FilePath -> IO Finfo
getFinfo path = do
sz <- getFileSize path
return $ Finfo path (fromIntegral sz)
The only remaining function is getAllFiles
, which gets a list of all files (as full path names, already joined with the parent directory). One way to write it is with pathWalkLazy
from System.Directory.PathWalk
:
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = do
nodes <- pathWalkLazy root
-- get file paths from each node
let files = [dir </> file | (dir, _, files) <- nodes, file <- files]
return files
A full sample program. It takes a single argument, the directory to process.
import System.Directory
import System.Directory.PathWalk
import System.Environment
import System.FilePath
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as Map
type BySize = Map Int [Finfo]
getBySize :: FilePath -> IO BySize
getBySize root = do
-- first, get all the files
files <- getAllFiles root
-- convert them all to finfos
finfos <- mapM getFinfo files
-- get a list of size/finfo pairs
let pairs = [(f_size finfo, finfo) | finfo <- finfos]
-- convert it to a map, allowing duplicate keys
return $ fromListWithDuplicates pairs
-- this is a little complicated, but standard
fromListWithDuplicates :: Ord k => [(k, v)] -> Map k [v]
fromListWithDuplicates pairs = Map.fromListWith ( ) [(k, [v]) | (k, v) <- pairs]
getAllFiles :: FilePath -> IO [FilePath]
getAllFiles root = do
nodes <- pathWalkLazy root
-- get file paths from each node
let files = [dir </> file | (dir, _, files) <- nodes, file <- files]
return files
data Finfo = Finfo { f_path :: FilePath, f_size :: Int }
deriving (Show)
getFinfo :: FilePath -> IO Finfo
getFinfo path = do
sz <- getFileSize path
return $ Finfo path (fromIntegral sz)
main = do
[root] <- getArgs
bs <- getBySize root
print bs