Tail-recursive string splitting in Haskell-CodePudding

I'm considering the problem of splitting a string s at a character c.

This is expressed as

break (c ==) s

where the Haskell library definition of break (c ==) close enough to

br []      = ([],[])
br s@(h:t) = if (c == h)
             then ([],s)
             else let (h',t') = br t in (h:h',t')

(And let's suppose that I immediately wanted access to the second item of the return value, so that any lazy evaluation has been forced through.) The recursive call to br t appears to store h on the call stack, but a general sense of the algorithm indicates that this shouldn't be necessary. Here is one way of doing it in constant stack space, in a pseudocode language with mutability, where & denotes passage by reference, and lists are implemented as LISPy pairs:

br(c,s) = 
  allocate res_head,res_rest
  iter(c,s,&res_head,&res_rest)
  return (res_head,res_rest)
iter(c,s,&res_head,&res_rest) =
  case s of
    []   -> set res_head = res_rest = []         -- and terminate
    c:ss -> set res_head = [], res_rest = s      -- and terminate
    x:ss -> allocate new_pair
            set res_head = new_pair, new_pair.head = x
            iter(c,ss,&new_pair.tail,&res_rest)  -- tail call / jump

Whether or not GHC is smart enough to find this optimization, I'd like to formulate the computation in Haskell in a manner that is patently tail-recursive. How might one do this?

CodePudding user response：

The standard accumulator introduction trick would produce something like this:

breakAt :: Char -> String -> (String, String)
breakAt needle = breakAtAcc []
  where breakAtAcc :: String -> String -> (String, String)
        breakAtAcc seen [] = (reverse seen, [])
        breakAtAcc seen cs@(c:cs')
          | c == needle
            = (reverse seen, cs)
          | otherwise
            = breakAtAcc (c : seen) cs'

We process the characters that make up the pre-split part of the return value in the wrong order for building up a list, so they need to be reversed at the end. However even ignoring that (using a version without the `reverse), this is probably worse.

In Haskell you're worrying about the wrong thing if you're concerned about the stack overflow errors you would see from deep recursion in many other languages (often prevented by tail call optimisation, hence tail recursion being desirable). Haskell does not have this kind of stack overflow. Haskell does have a stack, which can overflow, but it's not the normal call stack from imperative languages.

For example, if I start GHCi with ghci RTS -K65k to explicitly set the maximum stack size to 65 KB (about the smallest value I could get it to start up with), then tripping the standard foldr ( ) stack overflow doesn't take much:

λ foldr ( ) 0 [1..3000]
*** Exception: stack overflow

A mere 3,000 recursive steps kills it. But I can run your br on much larger lists without problem:

λ let (pre, post) = br 'b' (replicate 100000000 'a'    "b") in (length pre, length post)
(100000000,1)
it :: (Int, Int)

That's 100 million non-tail recursive steps. If each of those took a stack frame and they were fitting in out 65 KB stack, we'd be getting about 1500 stack frames for every byte. Clearly this kind of recursion does not actually cause the stack consumption problems it does in other languages! That's because it's not the recursion depth itself that's causing the stack overflow in foldr ( ) 0 [1..3000].

The advantage br has over a tail-recursive version like breakAt is that it's productive. If you're only interested in the first n characters of the prefix, then at most n characters of the input string will be examined (if you're interested in the post-split string, then obviously it will need to examine enough of the string to find the split). You can observe this by running br and breakAt on a long input string and taking small bit of prefix, something like this:

λ let (pre, post) = br 'b' (replicate 100000000 'a'    "b") in take 5 pre
"aaaaa"
it :: [Char]

If you try the same thing with breakAt (even if you take out the call to reverse), it'll at first only print " and then spend a long time thinking before eventually coming up with the rest of "aaaaa". That's because it has to find the split point before it returns anything except another recursive call; the first character of the prefix is not available until the split point has been reached. And that's the essence of tail recursion; there's no way to fix it.

You can see it even more definitively by using undefined:

λ let (pre, post) = br 'b' ("12345"    undefined) in take 5 pre
"12345"
it :: [Char]

λ let (pre, post) = breakAtRev 'b' ("12345"    undefined) in take 5 pre
"*** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:74:14 in base:GHC.Err
  undefined, called at <interactive>:18:46 in interactive:Ghci8

br can return the first 5 characters without examining whether or not there is a 6th. breakAt (with or without reverse) forces more of the input, and so hits the undefined.

This is a common pattern in Haskell. Changing an algorithm to make it tail recursive frequently makes performance worse. You do want tail recursion if the final return value is a small type like an Int, Double, etc that can't be consumed in a "gradual" way; but you need to make sure any accumulator parameter you're using is strictly evaluated in that case! That's why for summing a list foldl' is better than foldr; there's no way to consume the sum "gradually" so we want tail recursion like foldl, but it has to be the strict variant foldl' or we still get stack overflows even though it's tail recursive! But when you're returning something like a list or a tree, it's much better if you can arrange for consuming the result gradually to cause the input to be read gradually. Tail recursion fundamentally does not allow this.