Home > OS >  Megaparsec: transforming comment syntax into a Record
Megaparsec: transforming comment syntax into a Record

Time:11-04

Using Megaparsec, if I want to parse a string containing comments of the form ~{content} into a Comment record, how would I go about doing that? For instance:

data Comment = { id :: Integer, content :: String }

parse :: Parser [Comment]
parse = _

parse
  "hello world ~{1-sometext} bla bla ~{2-another comment}" 
  == [Comment { id = 1, content = "sometext" }, Comment { id = 2, content = "another comment"}]

The thing I'm stuck on is allowing for everything that's not ~{} to be ignored, including the lone char ~ and the lone brackets {}.

CodePudding user response:

You can do this by dropping characters up to the next tilde, then parsing the tilde optionally followed by a valid comment, and looping.

In particular, if we define nonTildes to discard non-tildes:

nonTildes :: Parser String
nonTildes = takeWhileP (Just "non-tilde") (/= '~')

and then an optionalComment to parse a tilde and optional following comment in braces:

optionalComment :: Parser (Maybe Comment)
optionalComment = char '~' *>
  optional (braces (Comment <$> ident_ <* char '-' <*> content_))
  where
    braces = between (char '{') (char '}')
    ident_ = read <$> takeWhile1P (Just "digit") isDigit
    content_ = takeWhileP Nothing (/= '}')

Then the comments can be parsed with:

comments :: Parser [Comment]
comments = catMaybes <$> (nonTildes *> many (optionalComment <* nonTildes))

This assumes that a ~{ without a matching } is a parse error, rather than valid non-comment text, which seems sensible. However, the definition of the content_ parser is probably too liberal. It gobbles everything up to the next }, meaning that:

"~{1-{{{\n}"

is a valid comment with content "{{{\n". Disallowing { (and maybe ~) in comments, or alternatively requiring braces to be properly nested in comments seems like a good idea.

Anyway, here's a full code example for you to fiddle with:

{-# OPTIONS_GHC -Wall #-}

import Data.Char
import Data.Maybe
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char

type Parser = Parsec Void String

data Comment = Comment { ident :: Integer, content :: String } deriving (Show)

nonTildes :: Parser String
nonTildes = takeWhileP (Just "non-tilde") (/= '~')

optionalComment :: Parser (Maybe Comment)
optionalComment = char '~' *>
  optional (braces (Comment <$> ident_ <* char '-' <*> content_))
  where
    braces = between (char '{') (char '}')
    ident_ = read <$> takeWhile1P (Just "digit") isDigit
    content_ = takeWhileP Nothing (/= '}')

comments :: Parser [Comment]
comments = catMaybes <$> (nonTildes *> many (optionalComment <* nonTildes))

main :: IO ()
main = do
  parseTest comments "hello world ~{1-sometext} bla bla ~{2-another comment}"
  parseTest comments "~~~ ~~~{1-sometext} {junk}"
  • Related