How does `Solo` prevent space leaks?-CodePudding

I ran into the documentation on Solo, the one element tuple, and was a bit confused about how it says it can prevent space leaks, which makes me suspect I'm not understanding something about how the Haskell memory model and/or garbage collector works.

To quote the docs, they say:

The most important feature of Solo is that it is possible to force its "outside" (usually by pattern matching) without forcing its "inside", because it is defined as a datatype rather than a newtype. One situation where this can be useful is when writing a function to extract a value from a data structure. Suppose you write an implementation of arrays and offer only this function to index into them:

index :: Array a -> Int -> a

Now imagine that someone wants to extract a value from an array and store it in a lazy-valued finite map/dictionary:

insert "hello" (arr index 12) m

This can actually lead to a space leak. The value is not actually extracted from the array until that value (now buried in a map) is forced. That means the entire array may be kept live by just that value! Often, the solution is to use a strict map, or to force the value before storing it, but for some purposes that's undesirable.

This is what I'm having trouble understanding. Presumably a is boxed, and hence the array arr is an array of pointers (if it wasn't boxed, a would already be evaluated and this argument would be moot).

So I'm guessing there's this pointer in this array arr to an unevaluated thunk that is of type a. We then place it in the map, so the map now contains a pointer to an unevaluated thunk of type a. Now I don't understand why this array arr needs to stay alive at this point. Nothing we've created in the map points to arr. The map has got it's own pointer to the unevaluated thunk of type a, which it can evaluate at it's own leisure. The only thing keeping arr alive might be if the unevaluated thunk depends on the array arr, but if that's the case I'm not sure how wrapping the value in a Solo data type helps?

I'm sure I'm missing something. And I suspect understanding what I'm missing will expose what of my above thinking is wrong. And that's a good thing if I can work out where I'm wrong. So any ideas/explanations?

CodePudding user response：

There are essentially two kinds of "space leaks" in Haskell. One is wasting space on thunks when it would have been more space efficient to just produce the value early. The other is wasting space on big data structures when it would have been more efficient to produce them later (or not at all).

The authors are considering an expression like this:

index arr 12

Imaging that arr is a large data structure and the result is a single element contained within it; all index does is select the element. If the expression index arr 12 is left as a thunk, the thunk will necessarily contain a reference to arr, and thus the garbage collector will not be able to reclaim the memory of arr while the thunk is live.

The usual obvious thing to do would be to arrange for index arr 12 to be executed earlier than actually required (as the authors suggest, putting it in a strict Map rather than a lazy one, but the context of "putting it in a map" isn't actually necessary). If you force the expression index arr 12 when you decide that's what you're getting (as a strict map would do when you insert something into it), rather than when you actually use it for anything, then the function index has run to completion at that point of decision and the reference to arr no longer needs to be kept around until you use the result.

But remember that forcing something evaluates it to the outermost data constructor. index doesn't involve any data constructors, since it just returns a value that was already inside arr. So the outermost data constructor reached by evaluating index arr 12 is going to be something from whatever type the element is. But what if the elements of arr (or at least the one at index 12) were themselves stored as unevaluated thunks? If those elements are actually large, it's entirely possible that fully generating one of those elements isn't much better than storing a big array of thunks¹. By forcing index arr 12 early we may have avoided one kind of space leak (keeping a big thunk too long) but caused another (producing a big value too early). And without pinning down the type involved, we can't know which is worse!

The problem is that evaluating to the outermost data constructor has forced "too much". We want the evaluation to proceed far enough to no longer depend on arr (i.e. to know which of the elements it contained we are returning), but we don't want to actually enter a thunk representing the element.

The way you could use Solo here is simply to wrap a data constructor around the returned element, so that when you force the thunk to the outermost constructor you get to Solo and no further. The authors state that it a common solution to the issue of space leaks from indexing thunks holding onto entire arrays is "to include an indexing function that can produce its result in an arbitrary Applicative context: indexA :: Applicative f => Array a -> Int -> f a", and that you could use Solo as the applicative to put an extra data constructor in the right place without needing to use an applicative functor that actually has any interesting effects.

As I understand it though, wrapping in Solo only solves the second potential space leak. indexA arr 12 :: Solo a doesn't magically stop depending on arr if you leave it as a thunk. However it enables you to use early evaluation to solve the arr space leak without having to accept a potential leak from the element itself.

¹ Or simply that producing it in full is costly enough, in time or space, that we don't want to pay for it yet. And it might not be absolutely determined that we're going to use it at all yet; if the downstream consumer turns out not to need it when we would rather not produce it even if the element is much smaller than the original array (all we would need is that it's smaller than the thunk representing itself).

CodePudding user response：

First off, the documentation you quoted has an error and it's actually rather relevant.

insert "hello" (arr index 12) m

should be

insert "hello" (index arr 12) m

This does, in fact, hold a pointer to arr. Until index arr 12 is evaluated, it's a thunk holding pointers to each of index, arr, and 12. The pointers to index and 12 aren't a big deal, but arr could be huge.

Now, as for the way Solo helps... it wouldn't, in general. That's a really weird claim. Like, they propose a function

indexA :: Applicative f => Array a -> Int -> f a

and then to use it as so:

case arr indexA 12 of
    Solo a -> insert "hello" a m

But that wouldn't actually help with anything, unless indexA has a really unexpected implementation. Solo's implementation of pure is non-strict, as expected from the description of the data type. So the expected implementation of indexA just wraps the result of the lookup with pure:

indexA arr i = pure $ index arr i

In order for the explanation provided to make any sense, the implementation would need to be more like this:

indexA arr i = pure $! index arr i

I suppose if a library provides that function it only makes sense if it's got a stricter implementation, but I'm wary of assuming that's the implementation of a function like that or that Solo is actually useful to solve problems like this documentation proposes.

Now, there is something actually useful about the strictness properties of Solo, specifically with regards to the Monad instance. Let's contrast with the Monad instance of Identity:

ghci> do { x <- pure () ; y <- undefined ; pure x } :: Identity ()
Identity ()

ghci> do { x <- pure () ; y <- undefined ; pure x } :: Solo ()
*** Exception: Prelude.undefined
CallStack (from HasCallStack):
  undefined, called at <interactive>:8:26 in interactive:Ghci4

The fact that Solo is lifted where Identity is not makes the Monad instance for Solo more strict. (>>=) forces the evaluation of the outer Solo constructor in its first argument, which means it actually notices if it gets a bottom value as its first argument when it's otherwise unused. Because Identity constructors don't exist at run time, evaluating them just defers all computation for later, making the (>>=) implementation less strict

CodePudding user response：

So I'm guessing there's this pointer in this array arr to an unevaluated thunk that is of type a. We then place it in the map, so the map now contains a pointer to an unevaluated thunk of type a. Now I don't understand why this array arr needs to stay alive at this point.

The point is that insert "hello" (index arr 12) m doesn't just place the existing unevaluated thunk into the map. It creates a new thunk to represent index arr 12, and stores that in the map. And that thunk does require arr to still be alive.