I ran into the documentation on Solo
, the one element tuple, and was a bit confused about how it says it can prevent space leaks, which makes me suspect I'm not understanding something about how the Haskell memory model and/or garbage collector works.
To quote the docs, they say:
The most important feature of Solo is that it is possible to force its "outside" (usually by pattern matching) without forcing its "inside", because it is defined as a datatype rather than a newtype. One situation where this can be useful is when writing a function to extract a value from a data structure. Suppose you write an implementation of arrays and offer only this function to index into them:
index :: Array a -> Int -> a
Now imagine that someone wants to extract a value from an array and store it in a lazy-valued finite map/dictionary:
insert "hello" (arr index 12) m
This can actually lead to a space leak. The value is not actually extracted from the array until that value (now buried in a map) is forced. That means the entire array may be kept live by just that value! Often, the solution is to use a strict map, or to force the value before storing it, but for some purposes that's undesirable.
This is what I'm having trouble understanding. Presumably a
is boxed, and hence the array arr
is an array of pointers (if it wasn't boxed, a
would already be evaluated and this argument would be moot).
So I'm guessing there's this pointer in this array arr
to an unevaluated thunk that is of type a
. We then place it in the map, so the map now contains a pointer to an unevaluated thunk of type a
. Now I don't understand why this array arr
needs to stay alive at this point. Nothing we've created in the map points to arr
. The map has got it's own pointer to the unevaluated thunk of type a
, which it can evaluate at it's own leisure. The only thing keeping arr
alive might be if the unevaluated thunk depends on the array arr
, but if that's the case I'm not sure how wrapping the value in a Solo
data type helps?
I'm sure I'm missing something. And I suspect understanding what I'm missing will expose what of my above thinking is wrong. And that's a good thing if I can work out where I'm wrong. So any ideas/explanations?
CodePudding user response:
There are essentially two kinds of "space leaks" in Haskell. One is wasting space on thunks when it would have been more space efficient to just produce the value early. The other is wasting space on big data structures when it would have been more efficient to produce them later (or not at all).
The authors are considering an expression like this:
index arr 12
Imaging that arr
is a large data structure and the result is a single element contained within it; all index
does is select the element. If the expression index arr 12
is left as a thunk, the thunk will necessarily contain a reference to arr
, and thus the garbage collector will not be able to reclaim the memory of arr
while the thunk is live.
The usual obvious thing to do would be to arrange for index arr 12
to be executed earlier than actually required (as the authors suggest, putting it in a strict Map
rather than a lazy one, but the context of "putting it in a map" isn't actually necessary). If you force the expression index arr 12
when you decide that's what you're getting (as a strict map would do when you insert something into it), rather than when you actually use it for anything, then the function index
has run to completion at that point of decision and the reference to arr
no longer needs to be kept around until you use the result.
But remember that forcing something evaluates it to the outermost data constructor. index
doesn't involve any data constructors, since it just returns a value that was already inside arr
. So the outermost data constructor reached by evaluating index arr 12
is going to be something from whatever type the element is. But what if the elements of arr
(or at least the one at index 12) were themselves stored as unevaluated thunks? If those elements are actually large, it's entirely possible that fully generating one of those elements isn't much better than storing a big array of thunks1. By forcing index arr 12
early we may have avoided one kind of space leak (keeping a big thunk too long) but caused another (producing a big value too early). And without pinning down the type involved, we can't know which is worse!
The problem is that evaluating to the outermost data constructor has forced "too much". We want the evaluation to proceed far enough to no longer depend on arr
(i.e. to know which of the elements it contained we are returning), but we don't want to actually enter a thunk representing the element.
The way you could use Solo
here is simply to wrap a data constructor around the returned element, so that when you force the thunk to the outermost constructor you get to Solo
and no further. The authors state that it a common solution to the issue of space leaks from indexing thunks holding onto entire arrays is "to include an indexing function that can produce its result in an arbitrary Applicative context: indexA :: Applicative f => Array a -> Int -> f a
", and that you could use Solo
as the applicative to put an extra data constructor in the right place without needing to use an applicative functor that actually has any interesting effects.
As I understand it though, wrapping in Solo
only solves the second potential space leak. indexA arr 12 :: Solo a
doesn't magically stop depending on arr
if you leave it as a thunk. However it enables you to use early evaluation to solve the arr
space leak without having to accept a potential leak from the element itself.
1 Or simply that producing it in full is costly enough, in time or space, that we don't want to pay for it yet. And it might not be absolutely determined that we're going to use it at all yet; if the downstream consumer turns out not to need it when we would rather not produce it even if the element is much smaller than the original array (all we would need is that it's smaller than the thunk representing itself).
CodePudding user response:
First off, the documentation you quoted has an error and it's actually rather relevant.
insert "hello" (arr index 12) m
should be
insert "hello" (index arr 12) m
This does, in fact, hold a pointer to arr
. Until index arr 12
is evaluated, it's a thunk holding pointers to each of index
, arr
, and 12
. The pointers to index
and 12
aren't a big deal, but arr
could be huge.
Now, as for the way Solo
helps... it wouldn't, in general. That's a really weird claim. Like, they propose a function
indexA :: Applicative f => Array a -> Int -> f a
and then to use it as so:
case arr indexA 12 of
Solo a -> insert "hello" a m
But that wouldn't actually help with anything, unless indexA
has a really unexpected implementation. Solo
's implementation of pure
is non-strict, as expected from the description of the data type. So the expected implementation of indexA
just wraps the result of the lookup with pure
:
indexA arr i = pure $ index arr i
In order for the explanation provided to make any sense, the implementation would need to be more like this:
indexA arr i = pure $! index arr i
I suppose if a library provides that function it only makes sense if it's got a stricter implementation, but I'm wary of assuming that's the implementation of a function like that or that Solo
is actually useful to solve problems like this documentation proposes.
Now, there is something actually useful about the strictness properties of Solo
, specifically with regards to the Monad
instance. Let's contrast with the Monad
instance of Identity
:
ghci> do { x <- pure () ; y <- undefined ; pure x } :: Identity ()
Identity ()
ghci> do { x <- pure () ; y <- undefined ; pure x } :: Solo ()
*** Exception: Prelude.undefined
CallStack (from HasCallStack):
undefined, called at <interactive>:8:26 in interactive:Ghci4
The fact that Solo
is lifted where Identity
is not makes the Monad
instance for Solo
more strict. (>>=)
forces the evaluation of the outer Solo
constructor in its first argument, which means it actually notices if it gets a bottom value as its first argument when it's otherwise unused. Because Identity
constructors don't exist at run time, evaluating them just defers all computation for later, making the (>>=)
implementation less strict
CodePudding user response:
So I'm guessing there's this pointer in this array arr to an unevaluated thunk that is of type a. We then place it in the map, so the map now contains a pointer to an unevaluated thunk of type a. Now I don't understand why this array arr needs to stay alive at this point.
The point is that insert "hello" (index arr 12) m
doesn't just place the existing unevaluated thunk into the map. It creates a new thunk to represent index arr 12
, and stores that in the map. And that thunk does require arr
to still be alive.