I want to select the n
-th last line from a large text file (~10GB) in a Haskell program.
I found a way how to get the n
-th last from an internal string:
myLen = 7
n = 3 -- one-based from the end
myLines = lines myText
idx = myLen - n
theLine = head (drop idx myLines)
main :: IO ()
main = do
putStrLn theLine
The documentation about the readFile
function says it "reads the content lazily", so once readFile
got to the n
-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
So, is readFile
the right approach here? Plus how do I get the IO String
output from readFile
"in a lazy way" into a list of lines so that I can then select the n
-th last line?
CodePudding user response:
The question has several parts:
The documentation about the readFile function says it "reads the content lazily", so once readFile got to the n-th last line will it have stored all the lines before in memory (and then explodes because I don't have that much memory)?
Not necessarily. If you only iterate over the contents and produce a result, then the garbage collector should deallocate the contents.
So, is readFile the right approach here?
My opinionated answer is that if it's for a serious tool, readFile
isn't the right approach because "lazy IO" is a can of worms.
If it's for a quick and dirty script then go ahead, but if not, and if performance is important, then it is probably best to use lower level calls to read strict ByteString
s, and for your problem read directly from the end of the file and process that.
CodePudding user response:
The following program will require only about as much memory as the longest n
lines in the file being read:
-- like drop, but takes its number encoded as a lazy
-- unary number via the length of the first list
dropUnary :: [a] -> [b] -> [b]
dropUnary [] bs = bs
dropUnary (_:as) (_:bs) = dropUnary as bs
takeLast :: Int -> [a] -> [a]
takeLast n as = dropUnary (drop n as) as
main :: IO ()
main = putStrLn . head . takeLast 3 . lines =<< readFile
The Prelude's lines
function is already suitably lazy, but some care was taken in writing takeLast
here. You can think of this as operating in "one pass" of the file, looking at subsequent chunks of n consecutive lines until it finds the last chunk. Because it does not maintain any references to the contents of the file from before the chunk it's currently looking at, all of the file contents up to the current chunk can be garbage collected (and generally is, fairly soon).