Megaparsec .. Find an Element Before Encountering a Different Element-CodePudding

I am trying to parse (using magaparsec) the XML export of FreePlane (mindmapper).

This is my third attempt to really 'learn' (internalize) megaparsec. I've written several parsers before, two worked (after a lot of struggling), and one I gave up and parsed manually. I must be missing some fundamental concepts ... Help in that direction would be appreciated

Executive Summary:

I need a Parser that does this ..

Find the first thing before you find the second thing.
If there is no first thing before you get to the second thing generate an error
Don't go past the second thing

I'm probably not explaining it well ..

Here is the current specific problem ..

An XML section is given below (dots arbitray characters) which shows two attribute 'chunks' each chunk surrounded by "< .. >"

... <... LINK="http://www.a.com" ... >...<... LINK="http://b.abc.com > ...

The problem is that the 'LINK' attribute is optional.

The 'chunks' have tons of attributes so I don't know how to parse them individually .. So I skip them arbitrarily only caring about the 'LINK' attribute.

So, if I do the naive thing on the above section ..

   manyTill asciiChar (string "LINK=\""
   text = manyTill asciiChar (char '"')
   return text

it works fine.

http://www.a.com

The link for the first chunk is returned.

If however the 'LINK' attribute is missing in the first chunk (shown below

...<..... >...<... LINK="http://b.abc.com > ...

my naive parser will return,

http://b.abc.com

which is incorrect.

I want it to error (so later I can use <|> )

It's been a 3 days on this problem trying different approaches .. including just trying to find a parser to do what I wanted ..

i.e. find the first thing (LINK) before you find the second thing ('>') .. If not error.

I finally gave up and tried the old fashioned way .. i.e. extract the text between the "< >", then use ordinary list functions to extract the link name .. and fail if it can't find it. This works but it's quite ugly!

Here's the code ..

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE NoMonomorphismRestriction #-}

import Data.List (find, isInfixOf, isPrefixOf)
import Data.Maybe
import qualified Data.Set as Set
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char

type Parser = Parsec Void String

--------------------------------
-- Skip a Chunk with no LINK  --
--------------------------------
skipChunk :: Parser String
skipChunk = do
  _ <- manyTill asciiChar (char '<')
  _ <- manyTill asciiChar (char '>')
  return ""

------------------------------------
-- Extrack LINK old fashioned way --
------------------------------------
linkName :: String -> String
linkName text =
  let ws = words text
   in fromJust $ find (\w -> isPrefixOf "LINK=" w) ws

----------------
-- Parse Link --
----------------
parseLink :: Parser String
parseLink = do
  _ <- lookAhead (manyTill asciiChar (char '<')) -- make sure to not consume input
  text <- lookAhead (manyTill asciiChar (char '>')) -- capture chunk
  name <-
    if (isInfixOf "LINK=\"" text) -- extract link old fashioned way
      then return $ linkName text
      else (failure Nothing Set.empty)
  return name

-----------------
-- Base Parser --
-----------------
baseParser :: Parser String
baseParser = do
  parseLink <|> skipChunk -- First parser fails, skips the chunk
  parseLink -- Yields LINK from second chunk

----------
-- Main --
----------
main :: IO ()
main = do
  result <- parseTest baseParser "...<.. INK=\"abc\" ... > ... <... LINK=\"http://b.abc.com > ..."
  putStrLn $ show result

This actually works and yields

"LINK="http://b.abc.com"

But wow .. is this ugly !!!

So,

Is there a Parser to "Find the first thing before you find the second thing .. If not error"
It seems like megaparsec is good if you are going character by character and your grammar is well defined and well known. Skipping arbitrary sections seems problematic .. Is this true?
I love Haskell but if anyone can point me to a good book on Visual Basic I'd appreciate it, as that seems to be my intellectual limit.

Thanks in advance,

Tom

CodePudding user response：

Apologies for the short answer

The issue is that manyTill asciiChar (string "LINK=\"") is incorrect. Not all of asciiChar should be allowed! In particular, > should not be allowed because it is supposed to mark the end of the current chunk
Trying to write an XML parser in this style (naive string searching) is probably not going to go well. XML is just too complex for string-searching to be effective on.

If your goal is to parse XML, I would recommend using an existing XML-parsing library. If your goal is to learn Megaparsec, I would recommend approaching the problem from a more generic point of view: instead of "how do I extract the LINK attribute?" something more like "How do I parse tags? How do I parse attributes? What should the result type look like?". Correctly implementing a program that extracts the LINK attribute will essentially require accounting for all of the complexity of XML, anyway, so may as well just make a generic parser.

Also, if you're trying to learn Megaparsec, XML is a pretty big chunk to bite off initially. Here are some intermediate problems, should they be of use to you, each of which should help with writing an XML parser: parse integers, parse strings (Haskell syntax), parse tuples of strings (("a", "b c", "d"), etc), parse tuples and nested tuples of strings (("a", ("b c", "d", ("e", "f gh i"), "j", ("k l", "m"))), etc).

Edit -- to address your remark that "as that seems to be my intellectual limit". Haskell is very hard, and parsing is also very hard. The fact that you're struggling with this problem does not necessarily reflect on you at all, because these things really genuinely are difficult. (Like for you, it also took me numerous attempts at parsing to start to "get" it. Same for Haskell!) If you really feel like this problem is too difficult for you at this time, then I would doubly-recommend starting with simpler ones (like my recommendations above) which will help with eventually grokking these tools