Does Haskell's `readFile` read the whole file content into memory?-CodePudding

I want to select the n-th last line from a large text file (~10GB) in a Haskell program.

I found a way how to get the n-th last from an internal string:

myLen = 7
n = 3 -- one-based from the end

myLines = lines myText
idx = myLen - n

theLine = head (drop idx myLines)

main :: IO ()
main = do
  putStrLn theLine

The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?

So, is readFile the right approach here? Plus how do I get the IO String output from readFile "in a lazy way" into a list of lines so that I can then select the n-th last line?

CodePudding user response：

The question has several parts:

The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?

Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.

So, is readFile the right approach here?

My opinionated answer is that if it's for a serious tool, readFile isn't the right approach because "lazy IO" is a can of worms.

If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteStrings, and for your problem read directly from the end of the file and process that.

CodePudding user response：

The following program will require only about as much memory as the longest n lines in the file being read:

-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs

takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as

main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile

The Prelude's lines function is already suitably lazy, but some care was taken in writing takeLast here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).