Say I have a code snippet like so:
type 'a mylist = | Nil | Cons of 'a * 'a mylist
let rec last l =
match l with
| Nil -> None
| Cons(x, Nil) -> Some(x)
| Cons(x, xs) -> last xs
Obviously, this pattern matches over the constructors of mylist
- Nil
and Cons
.
My question is how you'd go about implementing something like this. You can't just feed values to Cons
until you find some that match l
, so i'm assuming the OCaml compiler implements something like a "reverse constructor" that takes a value of type mylist
and attempts to turn it into the constituent parts?
For example, in psudo-ocaml:
type 'a mylist = | Nil | Cons of 'a * 'a mylist
Cons : ('a * 'a mylist) -> 'a mylist
UnCons : 'a mylist -> ('a mylist * 'a) option
Is this correct, or does it take another approach? What about for, say, haskell or sml?
CodePudding user response:
OCaml uses an uniform representation of values under the hood. Each value is either a tagged integer (for instance 1
) or a pointer to a block (for instance [|1;2|]
. A block is itself an array containing a tag. When a variant type is defined
type t =
| A
| B
| C of int
| D of int
constructors without arguments are represented as tagged integers whereas constructors with arguments are mapped to blocks.
A good way to learn more about this representation is to use the internal Obj
module which contain functions that manipulate on the memory representation of OCaml values.
For instance, we can peek at the underlying representation of the constructors. First, by checking that
Obj.(is_int @@ repr A)
returns true. Furthermore, we can also check that A
is represented as a tagged 0
in memory with:
assert (Obj.repr A = Obj.repr 0)
Similarly, C 1
is a block with tag 0 containing an integer in its first field
assert Obj.(
tag @@ repr (C 1) = 0
&& field (repr (C 1)) 0 = repr 1
)
With this uniform representation, pattern matching can be compiled into either a decision tree or a backtracking automata. Looking at your example:
type 'a mylist = | Nil | Cons of 'a * 'a mylist
let rec last l =
match l with
| Nil -> None
| Cons(x, Nil) -> Some(x)
| Cons(x, xs) -> last xs
The pattern match can be resumed to:
- check if the argument is
Nil
(aka0
), then returnNone
- if not check if the second field of the argument is
Nil
(aka0
), then returnSome (first_field of the argument)
- else call
last (second_field of the argument)
.
You can then compare that idea with the actual OCaml implementation by asking the OCaml compiler to emit its lambda intermediary representation with ocamlc -dlambda
:
(letrec
(last/271
(function l/272
(if l/272
(if (field_imm 1 l/272) (apply last/271 (field_imm 1 l/272))
(makeblock 0 (field_imm 0 l/272)))
0)))
Looking closely at this IR, we can observe that the pattern matching has been compiled to two if
, the first one on the list l/272
itself, and the second one on field_imm 1 l/272
, in other words on the second field of l/2
Note that in this specific case, there was not many possible optimizations so the compiled IR is quite close to the original pattern match.
However, for more complex match
, there are quite few layer of optimization to ensure that the generated code is small and efficient by moving around clause, merging or splitting them, and trying to keep as much contextual information as possible when a clause fails.
Currently in 2022, the OCaml compiler implementation mostly follows the algorithm described by https://dl.acm.org/doi/10.1145/507669.507641 for compiling pattern matching to efficient backtracking automata.
CodePudding user response:
The Spineless Tagless G-machine (STG machine) is a paper describing the general approach that GHC takes to evaluating expressions. It is quite old, and various improvements have been made since then, but I believe the general idea is still right (and even if it is completely outdated it's still interesting in describing one way you could do this stuff). The approach to handling data constructors and case expressions is only one of the topics it covers, though, and it is tough to just provide excerpts from that section without having to summarize all the previous sections as well. But, broadly speaking, every value in the language is represented as a pointer to executable code, and to use that value is to jump to the code it points to. The code pointed to by a Cons will be different than the code pointed to by a Nil, so you get the discrimination that way.
A related thing you might consider: how would you implement cons lists using only functions, with no type
facility in the language to explicitly define new types? It turns out this can be done for any data type in a fairly mechanical way, by representing a value as a function, where the function expects functions as arguments, one argument per constructor the type has. The nth argument represents "what I want to do if the value is of the nth constructor", and it requires one argument per field that constructor has, to let you get at the values in its fields.
That's a bit hard to follow in prose, so let's look at an example using cons and nil. I'll use Haskell notation, which I'm more familiar with, but the same ideas should be expressible in OCaml. The one part I'm not happy with pedagogically is the newtype at the beginning. It looks like "cheating" by defining a new data type, but it's necessary only at the type level, as a level of indirection to avoid an infinite type. At the term level, it's just a wrapper around a function.
{-# LANGUAGE RankNTypes #-}
import Prelude hiding (head, tail, last)
newtype List a = List (forall r. r -> (a -> List a -> r) -> r)
nil :: List a
nil = List (\n c -> n)
cons :: a -> List a -> List a
cons h t = List (\n c -> c h t)
head :: List a -> Maybe a
head (List list) = list Nothing (\h t -> Just h)
tail :: List a -> Maybe (List a)
tail (List list) = list Nothing (\h t -> Just t)
nil
and cons
are the constructors of this type, and head
and tail
are example simple consumers. Observe that nil
is represented by a function that just returns its first argument, while cons
is represented by a function that passes its head and tail to its second argument. head
and tail
each must pass two arguments to their lists, to describe how they want to handle empty and non-empty lists respectively.
How would we implement your last
? Recursively, of course:
last :: List a -> Maybe a
last (List list) = list Nothing (\h t -> Just (go h t))
where go :: a -> List a -> a
go curr (List rest) = rest curr go
The point of this exercise is to show that there's nothing fundamentally special about data types with multiple constructors or fields - it can all just be seen as syntax sugar over closures and functions. No real language implementation is going to do things as crudely as my example, but the STG machine's approach of treating each value as a function is in some ways just a (much) more sophisticated implementation of this idea.