How does a language like OCaml implement pattern matching?-CodePudding

Say I have a code snippet like so:

type 'a mylist = | Nil | Cons of 'a * 'a mylist

let rec last l =
    match l with
    | Nil -> None
    | Cons(x, Nil) -> Some(x)
    | Cons(x, xs) -> last xs

Obviously, this pattern matches over the constructors of mylist - Nil and Cons. My question is how you'd go about implementing something like this. You can't just feed values to Cons until you find some that match l, so i'm assuming the OCaml compiler implements something like a "reverse constructor" that takes a value of type mylist and attempts to turn it into the constituent parts?

For example, in psudo-ocaml:

type 'a mylist = | Nil | Cons of 'a * 'a mylist

Cons : ('a * 'a mylist) -> 'a mylist

UnCons : 'a mylist -> ('a mylist * 'a) option

Is this correct, or does it take another approach? What about for, say, haskell or sml?

CodePudding user response：

OCaml uses an uniform representation of values under the hood. Each value is either a tagged integer (for instance 1) or a pointer to a block (for instance [|1;2|]. A block is itself an array containing a tag. When a variant type is defined

type t =
 | A
 | B
 | C of int
 | D of int

constructors without arguments are represented as tagged integers whereas constructors with arguments are mapped to blocks. A good way to learn more about this representation is to use the internal Obj module which contain functions that manipulate on the memory representation of OCaml values. For instance, we can peek at the underlying representation of the constructors. First, by checking that

Obj.(is_int @@ repr A)

returns true. Furthermore, we can also check that A is represented as a tagged 0 in memory with:

assert (Obj.repr A = Obj.repr 0)

Similarly, C 1 is a block with tag 0 containing an integer in its first field

assert Obj.(
  tag @@ repr (C 1) = 0
  && field (repr (C 1)) 0 = repr 1
)

With this uniform representation, pattern matching can be compiled into either a decision tree or a backtracking automata. Looking at your example:

type 'a mylist = | Nil | Cons of 'a * 'a mylist

let rec last l =
    match l with
    | Nil -> None
    | Cons(x, Nil) -> Some(x)
    | Cons(x, xs) -> last xs

The pattern match can be resumed to:

check if the argument is Nil(aka 0), then return None
if not check if the second field of the argument is Nil (aka 0), then return Some (first_field of the argument)
else call last (second_field of the argument).

You can then compare that idea with the actual OCaml implementation by asking the OCaml compiler to emit its lambda intermediary representation with ocamlc -dlambda:

(letrec
  (last/271
     (function l/272
       (if l/272
         (if (field_imm 1 l/272) (apply last/271 (field_imm 1 l/272))
           (makeblock 0 (field_imm 0 l/272)))
         0)))

Looking closely at this IR, we can observe that the pattern matching has been compiled to two if, the first one on the list l/272 itself, and the second one on field_imm 1 l/272, in other words on the second field of l/2

Note that in this specific case, there was not many possible optimizations so the compiled IR is quite close to the original pattern match.

However, for more complex match, there are quite few layer of optimization to ensure that the generated code is small and efficient by moving around clause, merging or splitting them, and trying to keep as much contextual information as possible when a clause fails.

Currently in 2022, the OCaml compiler implementation mostly follows the algorithm described by https://dl.acm.org/doi/10.1145/507669.507641 for compiling pattern matching to efficient backtracking automata.

CodePudding user response：

The Spineless Tagless G-machine (STG machine) is a paper describing the general approach that GHC takes to evaluating expressions. It is quite old, and various improvements have been made since then, but I believe the general idea is still right (and even if it is completely outdated it's still interesting in describing one way you could do this stuff). The approach to handling data constructors and case expressions is only one of the topics it covers, though, and it is tough to just provide excerpts from that section without having to summarize all the previous sections as well. But, broadly speaking, every value in the language is represented as a pointer to executable code, and to use that value is to jump to the code it points to. The code pointed to by a Cons will be different than the code pointed to by a Nil, so you get the discrimination that way.

A related thing you might consider: how would you implement cons lists using only functions, with no type facility in the language to explicitly define new types? It turns out this can be done for any data type in a fairly mechanical way, by representing a value as a function, where the function expects functions as arguments, one argument per constructor the type has. The nth argument represents "what I want to do if the value is of the nth constructor", and it requires one argument per field that constructor has, to let you get at the values in its fields.

That's a bit hard to follow in prose, so let's look at an example using cons and nil. I'll use Haskell notation, which I'm more familiar with, but the same ideas should be expressible in OCaml. The one part I'm not happy with pedagogically is the newtype at the beginning. It looks like "cheating" by defining a new data type, but it's necessary only at the type level, as a level of indirection to avoid an infinite type. At the term level, it's just a wrapper around a function.

{-# LANGUAGE RankNTypes #-}

import Prelude hiding (head, tail, last)

newtype List a = List (forall r. r -> (a -> List a -> r) -> r)

nil :: List a
nil = List (\n c -> n)

cons :: a -> List a -> List a
cons h t = List (\n c -> c h t)

head :: List a -> Maybe a
head (List list) = list Nothing (\h t -> Just h)

tail :: List a -> Maybe (List a)
tail (List list) = list Nothing (\h t -> Just t)

nil and cons are the constructors of this type, and head and tail are example simple consumers. Observe that nil is represented by a function that just returns its first argument, while cons is represented by a function that passes its head and tail to its second argument. head and tail each must pass two arguments to their lists, to describe how they want to handle empty and non-empty lists respectively.

How would we implement your last? Recursively, of course:

last :: List a -> Maybe a
last (List list) = list Nothing (\h t -> Just (go h t))
  where go :: a -> List a -> a
        go curr (List rest) = rest curr go

The point of this exercise is to show that there's nothing fundamentally special about data types with multiple constructors or fields - it can all just be seen as syntax sugar over closures and functions. No real language implementation is going to do things as crudely as my example, but the STG machine's approach of treating each value as a function is in some ways just a (much) more sophisticated implementation of this idea.