Converting a string consisting of three types of data types to a list-CodePudding

I have the following datatypes:

-- 1.1
data LineItem = Space | Newline | Word String
    deriving (Eq)

-- 1.2
mkSpace :: LineItem
mkSpace = Space

mkNewline :: LineItem
mkNewline = Newline

mkWord :: String -> LineItem
mkWord = Word 

-- 1.3
lineItemToStr :: LineItem -> String
lineItemToStr Space = show " "
lineItemToStr Newline = show "\n"
lineItemToStr (Word s1) = show s1

instance Show LineItem where
    show = lineItemToStr

Given a String I have to convert it to a list where each element of the list is one of the datatypes. What I have so far is:

toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems (w:rest) = mkWord w : toLineItems rest -- this one is obviously wrong

My confusion so far is how I handle the occasion where the head is a char. I would think that I accumulate the chars until the head is no longer a letter and then append it as one word. But I'm not quite sure how you would go around matching that pattern in Haskell.

If anyone could point me in the right direction I would appreciate it.

CodePudding user response：

You can use span to split the rest of the string into a leading word and the rest:

toLineItems s = let (w, rest) = span (not . (`elem` "\n ")) s
                 in mkWord w : toLineItems rest

not . (`elem` "\n ") is a function which returns True for all Chars which are not ' ' or '\n', and span takes as many as it can fulfilling that predicate and puts them in the first element of the tuple and puts the remaining ones in the other tuple element.

CodePudding user response：

A String is just a list of Char values. Wrap w in a list before passing it to mkWord.

toLineItems (w:rest) = mkWord [w] : toLineItems rest

CodePudding user response：

Here is the approach using only pattern matching:

toLineItems :: String -> String ->  [LineItem]
toLineItems (' ':rest) [] = Space : toLineItems rest
toLineItems (' ':rest) acc = mkWord (reverse acc) : Space : toLineItems rest
toLineItems ('\n':rest) [] = Newline : toLineItems rest
toLineItems ('\n':rest) acc = mkWord (reverse acc) : Newline : toLineItems rest
toLineItems [] [] = []
toLineItems [] acc = mkWord (reverse acc)
toLineItems (w:rest) acc = toLineItems rest (w:acc)

Here we introduce an accumulator and match on it.

CodePudding user response：

Thank you for posting a complete, self-contained example, so we could duplicate your problem. I want to point out a few things in your "background" code (the part that was already working) that most Haskellers would consider bad practice, and I also have a long explanation of how to attack these problems in general, if you don't have a span or break function to help.

As for the "bad practice" items, if you wrote this code, consider it some advice to up your Haskell game. If this code was given to you as part of an exercise, consider it a warning that not everyone who teaches Haskell has a firm grasp of best practices for programming Haskell. Here's my advice:

Don't define mkSpace, mkNewline, and mkWord synonyms for the constructors Space, Newline, and Word. This is not idiomatic and is likely to be considered bad practice by nearly all experienced Haskell programmers. In your own code, you don't use these consistently anyway: toLineItems uses the function mkWord to make a Word, but it uses Space and Newline directly to make those corresponding values. You should just use the Word constructor directly, too. It's called a "constructor" for a reason!
Don't write a Show instance that misrepresents the data structure being shown. In particular, with your Show instance in place the desired output of lineItemToStr "hello world" is:
```
ghci> lineItemToStr "hello world"
["hello"," ","world"]
```
If your code is buggy and codes spaces as Word " " instead of Space, the output will instead be:
```
ghci> lineItemToStr "hello world"
["hello"," ","world"]
```
If you can't tell the difference between the correct and incorrect outputs, well... that's the problem. Having a "pretty printing" function like lineItemToStr is great, but you should delete the Show instance and derive one instead (... deriving (Show, Eq)). That way, the correct output will be easy to verify:
```
ghci> toLineItems "hello world"
[Word "hello",Space,Word "world"]
```

With that advice out of the way, the accepted answer, using span or break, would be the usual way of writing this function in real code, but it's important to understand how to handle this sort of problem using a direct recursive solution. There are two main approaches. One involves post-processing the results of the recursive call, the other involves introducing a helper function with an "accumulator". Both are a little tricky to get right, but you should make sure they're both in your Haskell toolbox.

The post-processing solution works like this. Suppose we are trying to write the final case for toLineItems where we have already handled the space and newline cases. We pattern match on c:rest, where c is some non-space, non-newline character:

toLineItems (c:rest) = ...

and we know that we could make the recursive call toLineItems rest. Let's imagine we're at the beginning of the "hello world" example. Then, we have:

c = 'h'
rest = "ello world"

and, assuming our function toLineItems actually works, the recursive call would give:

toLineItems rest = [Word "ello",Space,Word "world"]

How can we "fix" this answer to give us the correct result? Well, we can use a case construct to extract the first Word "ello" from the recursive call and add the 'h' character:

toLineItems (c:rest) = case toLineItems rest of
    (Word w : moreitems) -> Word (c:w) : moreitems

Similarly, if we were parsing toLineItems "ello world" with the toLineItems (c:rest) case, we would have:

c = 'e'
rest = "llo world"
-- result of recursive call:
toLineItems rest = [Word "llo",Space,Word "world"]

and the above case would correctly fix up the result of the recursive call to yield:

Word (c:w) : moreitems = [Word "ello",Space,Word "world"]

That takes care of the tricky case of adding the parsed character 'c' to the rest of a word. To take care of the other cases, consider what happens for toLineItems "o world", where we have:

c = 'o'
rest = " world"
-- result of recursive call:
toLineItems rest = [Space,Word "world"]

The correct answer for toLineItems "o world" is:

[Word "o",Space,Word "world"]

so we need a case like:

toLineItems (c:rest) = case toLineItems rest of
    ...
    (Space : moreitems) -> Word [c] : Space : moreitems

Similarly, for Newline:

    (Newline : moreitems) -> Word [c] : Newline : moreitems

The remaining case handles the end of string:

    [] -> Word [c] : []

If you've got a good eye, you'll see that all three of these cases can be handled by a single case, giving the answer:

toLineItems (c:rest) = case toLineItems rest of
    (Word w : moreitems) -> Word (c:w) : moreitems
    moreitems -> Word [c] : moreitems

Try out the complete definition:

toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems (c:rest) = case toLineItems rest of
    (Word w : moreitems) -> Word (c:w) : moreitems
    moreitems -> Word [c] : moreitems

and you'll see it works just fine.

The accumulator solution works like this. The idea is to write a helper function that gets invoked at the start of a word and uses an extra parameter, an "accumulator", to keep track of the work done so far (i.e., the characters of the word read so far). It "loops" by making recursive calls that add to this accumulator, and when it identifies the end of the word, it returns the accumulator value as the final answer. After handling the space and newline cases, we can invoke this helper with an empty accumulator:

toLineItems str = readWord "" str   -- first arg is accumulator

The helper function should first check for the "end of word", being careful not to drop the space or newline that ends the word when it passes control back to toLineItems.

readWord acc (' ':rest) = Word acc : toLineItems (' ':rest)
readWord acc ('\n':rest) = Word acc : toLineItems ('\n':rest)
readWord acc [] = Word acc : toLineItems []

and otherwise add the next character to the accumulator and call itself:

readWord acc (c:rest) = readWord (acc    [c]) rest

This gives the solution:

toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems str = readWord "" str
  where
    readWord acc (' ':rest) = Word acc : toLineItems (' ':rest)
    readWord acc ('\n':rest) = Word acc : toLineItems ('\n':rest)
    readWord acc [] = Word acc : toLineItems []
    readWord acc (c:rest) = readWord (acc    [c]) rest

which also works fine. It is not the most efficient solution:

At the start of the word, the first character -- which is already known to not be a space or a newline from matching by toLineItems is re-checked by readWord.
Accumulating a list by adding characters to the end with acc [c] is very inefficient in Haskell.
The first three patterns in readWord involve a lot of unnecessary duplication, since they all do roughly the same thing -- return Word acc and parse the rest with toLineItems. This is less an issue for program efficiency and more an issue of violating the DRY principle, making review of the code inefficient.

Here's a solution that addresses these problems:

toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems (c:rest)= readWord [c] rest  -- we know `c` is not whitespace
  where
    -- note: accumulator is reversed, because repeated `c:acc` is more
    -- efficient than `acc    [c]`
    readWord acc (c:rest) | c `notElem` " \n" = readWord (c:acc) rest
    readWord acc rest = Word (reverse acc) : toLineItems rest