Home > other >  Sum types whose constituents can also be used on their own
Sum types whose constituents can also be used on their own

Time:09-17

I want to represent PDF files for parsing and printing and am struggling to find good types for this.

PDF files contain values, which can be text, names (identifiers), dictionaries mapping names to values, and many others that I left out of these examples. I started out with something like this:

data Value = Text String | Name String | Dictionary [(String, Value)]

instance Show Value where
    show (Text text) = "("    text    ")"
    show (Name name) = "/"    name
    show (Dictionary entries) = "<<"    unlines (showEntry <$> entries)    ">>" where
        showEntry (key, value) = show (Name key)    " "    show value

Unfortunately, it would be easy for showEntry to accidentally use just show key or even show (Text key). The type system does not help with picking the right implementation. The fact that dictionary keys are names is not captured by their type, which is just String.

One could solve this by modeling keys as values:

data Value = Text String | Name String | Dictionary [(Value, Value)]

instance Show Value where
    show (Text text) = "("    text    ")"
    show (Name name) = "/"    name
    show (Dictionary entries) = "<<"    unlines (showEntry <$> entries)    ">>" where
        showEntry (key, value) = show key    " "    show value

This way, showEntry gets a key of type Value and thus automatically uses the correct implementation. However, this is arguably even worse, as now invalid Dictionary values with keys that are not names can be represented.

My next idea was to use separate types. Just like names are used in dictionaries, so are text and dictionaries used in other data structures, so they should also get their own types:

data Text = Text String
data Name = Name String
data Dictionary = Dictionary [(Name, Value)]
data Value = TextValue Text | NameValue Name | DictionaryValue Dictionary

instance Show Text where show (Text text) = "("    text    ")"
instance Show Name where show (Name name) = "/"    name
instance Show Dictionary where
    show (Dictionary entries) = "<<"    unlines (showEntry <$> entries)    ">>" where
        showEntry (key, value) = show key    " "    show value
instance Show Value where
    show (TextValue text) = show text
    show (NameValue name) = show name
    show (DictionaryValue dictionary) = show dictionary

The types now accurately represent the structure of the data and all is well. Unfortunately, this feels very ugly and redundant. To construct values, one now needs twice as many constructors:

DictionaryValue (Dictionary [(Name "foo", TextValue (Text "bar")), (Name "test", NameValue (Name "baz"))])

It feels like the tagged unions in ADTs just get in the way here, since the tags are unneeded, with the types already uniquely determining which case is being chosen.

Is this the best we can do or is there a nicer way for this scenario?

I imagine this type of problem comes up all the time in parsing formats that allow for nested values (like arithmetic expressions, XML, JSON, DSLs, etc.). What is the canonical/usual representation people use for this?

CodePudding user response:

My first recommendation would be to not use show for this. It's generally considered a bad idea to have show produce anything but valid Haskell code that would reproduce its input if passed to read. Using a new function, say pprintValue, instead of Show, already fixes several of your problems. It's now impossible to accidentally call pprintValue on a String, because it's now a function of concrete type Value -> String, rather than a polymorphic type.

Having done that, I would in fact do at least some of what your second snippet proposes. In particular you care quite a bit about the different contexts a string can appear in, and so I think Text and Name merit newtypes:

newtype Name = MkName String
newtype Text = MkText String

For Dictionary I might not bother, though. You don't need a lot of help disambiguating it, and any list of the right type seems like a safe way to construct a Dictionary Value. You complain that this "costs" some extra constructor calls. This is true, but not as bad as you suggest. Yes, when creating a Value from scratch, you have to write another newtype constructor or two. But how often do you do that? Probably only in one or two places, where you parse a file or decide on the response your server should provide. Often, you already have a Value, and its stuff is already wrapped up properly. For example, you might want to add to a dictionary value:

insert :: Name -> Value -> [(Name, Value)] -> [(Name, Value)]
insert n v d = (n, v) : d

Still, you will end up doing some newtype wrapping and unwrapping. But it's still useful, to help you make sure you are using them in the right way. This will help throughout your program, not just in calls to show (or pprintValue).

  • Related