How do you model "metadata" in Haskell?-CodePudding

I'm writing a parser in Haskell (mostly just to learn). I have a working tokenizer and parser and I want to add line numbers when giving an error message. I have this type:

data Token = Lambda
  | Dot
  | LParen
  | RParen
  | Ident String

Back in OO land, I would just create a Metadata object that holds the token's position in the source code. So I could try this:

data Metadata = Pos String Int Int

Then, I could change Token to

data Token = Lambda Metadata
  | Dot Metadata
  | LParen Metadata
  | RParen Metadata
  | Ident String Metadata

However, my parser is written using pattern matching on the tokens. So now, all my pattern matching is broken because I need to also account for the Metadata. So that doesn't seem ideal. 99% of the time, I don't care about the Metadata.

So what's the "right" way to do what I want to do?

CodePudding user response：

There’s a wide array of approaches to the design of syntax representations in Haskell, but I can offer some recommendations and reasoning.

It’s advisable to keep metadata annotations out of the Token type, so that it sticks to a single responsibility. If a Token represents just a token, its derived instances for Eq and so on will work as expected without needing to worry about when to ignore the annotation.

Thankfully, the alternatives are simple in this case. One option is to move the annotation info to a separate wrapper type.

-- An @'Anno' a@ is a value of type @a@ annotated with some 'Metadata'.
data Anno a = Anno { annotation :: Metadata, item :: a }
  deriving
    ( Eq
    , Ord
    , Show
    -- …
    )

Now the tokeniser can return a sequence of annotated tokens, i.e. [Annotated Token]. You still need to update the use sites, but the changes are now much simpler. And you can ignore annotations in various ways:

-- Positional matching
f1 (Anno _meta (Ident name)) = …

-- Record matching
f2 Anno { item = Ident name } = …

-- With ‘NamedFieldPuns’
f3 Anno { item } = …

-- 'U'nannotated value; with ‘PatternSynonyms’
pattern U :: a -> Anno a
pattern U x <- Anno _meta x

f4 (U LParen) = …

You can deannotate a sequence of tokens with fmap item to reuse existing code that doesn’t care about location info. And since Anno is a type of kind Type -> Type, GHC can also derive Foldable, Functor, and Traversable for it, making it easy to operate on the annotated item with e.g. fmap and traverse.

This is the preferable approach for Token, but for a parsed AST containing annotations, you may want to make the annotation type a parameter of the AST type, for example:

data Expr a = Add a (Expr a) (Expr a) | Literal a Int
  deriving (Eq, Foldable, Functor, Ord, Show, Traversable)

Then you can use Expr Metadata for an annotated term, or Expr () for an unannotated one. To compare terms for equality, such as in unit tests, you can use the Functor instance to strip out the annotations, e.g. void expr1 == void expr2, where void is equivalent to fmap (\ _meta -> ()) here.

In a larger codebase, if there’s a lot of code depending on a data type and you really want to avoid updating it all at once, you can wrap the old type in a module that exports a pattern synonym for each of the old constructors. This lets you gradually update the old code before deleting the adapter module.

Culturally, it’s typical in a self-contained Haskell codebase to simply make breaking changes, and let the compiler tell you everywhere that needs to be updated, since it’s so easy to do extensive refactoring with high assurance that it’s correct. We’re more concerned with backward compatibility when it comes to published library code, since that actually affects other people.