Home > other >  High-level organization of a parser with a monadic parsing library
High-level organization of a parser with a monadic parsing library

Time:10-30

I have been trying to learn Haskell, so for a first substantial project I am trying to write a parsing library for some binary formats, but I have hit an obstacle of sorts and am not sure how to proceed -- alert, newbie here. For the rest of my question assume we are working with a monadic parsing library like Attoparsec. Also assume that the whole bytestring to parse is in memory (it is not very long, in the order of magnitude of kilobytes) so no need to handle lazy bytestrings or lazy input.

So the binary format is roughly like this:

  • A (fixed width) main block.
  • A sequence of (fixed width) secondary blocks.
  • A sequence of (fixed width) blocks -- call them ops.

They are stitched together in the following way: the main block contains the number of both the secondary blocks and the ops, as well as bytestring offsets to them. Both the main block and the secondary blocks can have a sequence of ops attached. And the way this is organized is by having a count of the number of ops and an index into the sequence of ops. I want to parse the whole thing into records for the fixed width parts, and a vector of ops for the attached block of ops that both the main and the secondary blocks carry. But how should I organize the parsing of the vectors of ops? My first idea is, on a general, high-level overview of the parsing sequence:

  1. parse the main block into a record.
  2. make notice of the offset of the ops block and the number of ops.
  3. using 2., look ahead and parse the block of ops as a list of ops.
  4. populate the main block's vector of ops from this list.
  5. parse the secondary blocks into records, using the list of ops to populate their own blocks of ops.

Thoughts? Is there a better way to organize the parser, by maybe making use of some combinator that I am not aware of? Any extra info just ask.

CodePudding user response:

I would try to avoid anything described as "lookahead". Parse the file in order. One way to do this is to have a different type for "main block, with indexes into the ops list" and "main block, with ops inlined". Parse into the first type first, then update it once you have parsed the ops.

data Main op = Main { numSecondary, numOps :: Int
                      ops :: [op]
                    } 

main :: Parser (Main Int)
main = do
  ns <- int
  no <- int
  Main ns no <$> replicateM no int

file :: Parser (Main Op, [Secondary Op])
file = do
  Main ns no mainOps <- main
  secs <- replicateM ns secondary
  ops <- replicateM no op
  pure ( Main ns no (map (ops !!) mainOps),
         map (inflateSecondary ops) secs
       )

inflateSecondary :: Secondary Int -> Secondary Op
inflateSecondary = undefined
  • Related