I'm a beginner with Megaparsec and Haskell in general, and trying to write a parser for the following grammar:
A word will always be one of:
- A number composed of one or more ASCII digits (ie "0" or "1234") OR
- A simple word composed of one or more ASCII letters (ie "a" or "they") OR
- A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")
So far, I've got the following (this can probably be simplified):
data Word = Number String | SimpleWord String | Contraction String deriving (Show)
word :: Parser MyParser.Word
word = M.choice
[ Number <$> number
, Contraction <$> contraction
, SimpleWord <$> simpleWord
]
number :: Parser String
number = M.some C.numberChar
simpleWord :: Parser String
simpleWord = M.some C.letterChar
contraction :: Parser String
contraction = do
left <- simpleWord
void $ C.char '\''
right <- simpleWord
return (left "'" right)
But I'm having problem with defining a parser for skipping white spaces and anything that is non-alphanumeric. For example, given the input 'abc'
, the parser should discard the apostrophes and just take the "simple word".
The following doesn't compile:
filler :: Parser Char
filler = M.some (C.spaceChar A.<|> not C.alphaNumChar)
spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer
CodePudding user response:
Here is the complete working code that I came up with.
type Parser =
M.Parsec
-- The type for custom error messages. We have none, so use `Void`.
Void
-- The input stream type. Let's use `String` for now.
String
data Word = Number String | SimpleWord String | Contraction String deriving (Eq)
instance Show WordCount.Word where
show (Number x) = x
show (SimpleWord x) = x
show (Contraction x) = x
words :: String -> Either String [String]
-- Force parser to consume entire input
-- <* Sequence actions, discarding the value of the second argument.
words input = case M.parse (M.some WordCount.word A.<* M.eof) "" input of
-- :t err = M.ParseErrorBundle String Void
Left err ->
let e = M.errorBundlePretty err
_ = putStr e
in Left e
Right (x) -> Right $ map (show) x
word :: Parser WordCount.Word
word =
M.skipManyTill filler $
lexeme $
M.choice
-- <$> is infix for 'fmap'
[ Number <$> number,
Contraction <$> M.try contraction,
SimpleWord <$> simpleWord
]
number :: Parser String
number = M.some MC.numberChar
simpleWord :: Parser String
simpleWord = M.some MC.letterChar
contraction :: Parser String
contraction = do
left <- simpleWord
void $ MC.char '\''
right <- simpleWord
return $ left "'" right
-- Define separator characters
isSep :: Char -> Bool
isSep x = C.isSpace x || (not . C.isAlphaNum) x
-- Fillers fill the space between tokens
filler :: Parser ()
filler = void $ M.some $ M.satisfy isSep
-- 3rd and 4th arguments are for ignoring comments
spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty
-- A parser that discards trailing space
lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer
CodePudding user response:
First, you probably want to use some1
for number and simple words, otherwise "" would be a number.
Your filler parser is good. That should use some
because you want to allow for e.g. "they1234" to parse as SimpleWord "they"
and Number "1234"
.
What you need to say for the overall parser is that your text consists of zero or more words separated by filler
, with optional filler before and after. Fortunately megaparsec
re-exports lots of useful stuff from Control.Monad.Combinators for doing this.
So we can use sepBy
for the words separated by filler:
document :: Parser [Word]
document = do
_ <- filler -- Throw away any filler at the start.
result <- word `sepBy` filler
_ <- filler -- Throw away any filler at the end.
return result
We don't need optional
for the start and end filler because filler can be zero length.
Finally, a style point: in a real parser you would want to make the Word
type a bit more sophisticated. Something like:
data SimpleWord = Number String | SimpleWord String
data Word = Word SimpleWord | Contraction SimpleWord SimpleWord
That way whatever bit of code deals with Contraction
downstream doesn't have to find the apostrophe all over again or deal with the "impossible" case where there isn't one. Once you've found the structure information in your input, don't throw it away. But that's a side issue for this exercise.