I have the following datatypes:
-- 1.1
data LineItem = Space | Newline | Word String
deriving (Eq)
-- 1.2
mkSpace :: LineItem
mkSpace = Space
mkNewline :: LineItem
mkNewline = Newline
mkWord :: String -> LineItem
mkWord = Word
-- 1.3
lineItemToStr :: LineItem -> String
lineItemToStr Space = show " "
lineItemToStr Newline = show "\n"
lineItemToStr (Word s1) = show s1
instance Show LineItem where
show = lineItemToStr
Given a String I have to convert it to a list where each element of the list is one of the datatypes. What I have so far is:
toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems (w:rest) = mkWord w : toLineItems rest -- this one is obviously wrong
My confusion so far is how I handle the occasion where the head is a char. I would think that I accumulate the chars until the head is no longer a letter and then append it as one word. But I'm not quite sure how you would go around matching that pattern in Haskell.
If anyone could point me in the right direction I would appreciate it.
CodePudding user response:
You can use span
to split the rest of the string into a leading word and the rest:
toLineItems s = let (w, rest) = span (not . (`elem` "\n ")) s
in mkWord w : toLineItems rest
not . (`elem` "\n ")
is a function which returns True
for all Char
s which are not ' '
or '\n'
,
and span
takes as many as it can fulfilling that predicate and puts them in the first element of the tuple and puts the remaining ones in the other tuple element.
CodePudding user response:
A String
is just a list of Char
values. Wrap w
in a list before passing it to mkWord
.
toLineItems (w:rest) = mkWord [w] : toLineItems rest
CodePudding user response:
Here is the approach using only pattern matching:
toLineItems :: String -> String -> [LineItem]
toLineItems (' ':rest) [] = Space : toLineItems rest
toLineItems (' ':rest) acc = mkWord (reverse acc) : Space : toLineItems rest
toLineItems ('\n':rest) [] = Newline : toLineItems rest
toLineItems ('\n':rest) acc = mkWord (reverse acc) : Newline : toLineItems rest
toLineItems [] [] = []
toLineItems [] acc = mkWord (reverse acc)
toLineItems (w:rest) acc = toLineItems rest (w:acc)
Here we introduce an accumulator and match on it.
CodePudding user response:
Thank you for posting a complete, self-contained example, so we could duplicate your problem. I want to point out a few things in your "background" code (the part that was already working) that most Haskellers would consider bad practice, and I also have a long explanation of how to attack these problems in general, if you don't have a span
or break
function to help.
As for the "bad practice" items, if you wrote this code, consider it some advice to up your Haskell game. If this code was given to you as part of an exercise, consider it a warning that not everyone who teaches Haskell has a firm grasp of best practices for programming Haskell. Here's my advice:
Don't define
mkSpace
,mkNewline
, andmkWord
synonyms for the constructorsSpace
,Newline
, andWord
. This is not idiomatic and is likely to be considered bad practice by nearly all experienced Haskell programmers. In your own code, you don't use these consistently anyway:toLineItems
uses the functionmkWord
to make aWord
, but it usesSpace
andNewline
directly to make those corresponding values. You should just use theWord
constructor directly, too. It's called a "constructor" for a reason!Don't write a
Show
instance that misrepresents the data structure being shown. In particular, with yourShow
instance in place the desired output oflineItemToStr "hello world"
is:ghci> lineItemToStr "hello world" ["hello"," ","world"]
If your code is buggy and codes spaces as
Word " "
instead ofSpace
, the output will instead be:ghci> lineItemToStr "hello world" ["hello"," ","world"]
If you can't tell the difference between the correct and incorrect outputs, well... that's the problem. Having a "pretty printing" function like
lineItemToStr
is great, but you should delete theShow
instance and derive one instead (... deriving (Show, Eq)
). That way, the correct output will be easy to verify:ghci> toLineItems "hello world" [Word "hello",Space,Word "world"]
With that advice out of the way, the accepted answer, using span
or break
, would be the usual way of writing this function in real code, but it's important to understand how to handle this sort of problem using a direct recursive solution. There are two main approaches. One involves post-processing the results of the recursive call, the other involves introducing a helper function with an "accumulator". Both are a little tricky to get right, but you should make sure they're both in your Haskell toolbox.
The post-processing solution works like this. Suppose we are trying to write the final case for toLineItems
where we have already handled the space and newline cases. We pattern match on c:rest
, where c
is some non-space, non-newline character:
toLineItems (c:rest) = ...
and we know that we could make the recursive call toLineItems rest
. Let's imagine we're at the beginning of the "hello world"
example. Then, we have:
c = 'h'
rest = "ello world"
and, assuming our function toLineItems
actually works, the recursive call would give:
toLineItems rest = [Word "ello",Space,Word "world"]
How can we "fix" this answer to give us the correct result? Well, we can use a case
construct to extract the first Word "ello"
from the recursive call and add the 'h'
character:
toLineItems (c:rest) = case toLineItems rest of
(Word w : moreitems) -> Word (c:w) : moreitems
Similarly, if we were parsing toLineItems "ello world"
with the toLineItems (c:rest)
case, we would have:
c = 'e'
rest = "llo world"
-- result of recursive call:
toLineItems rest = [Word "llo",Space,Word "world"]
and the above case
would correctly fix up the result of the recursive call to yield:
Word (c:w) : moreitems = [Word "ello",Space,Word "world"]
That takes care of the tricky case of adding the parsed character 'c' to the rest of a word. To take care of the other cases, consider what happens for toLineItems "o world"
, where we have:
c = 'o'
rest = " world"
-- result of recursive call:
toLineItems rest = [Space,Word "world"]
The correct answer for toLineItems "o world"
is:
[Word "o",Space,Word "world"]
so we need a case like:
toLineItems (c:rest) = case toLineItems rest of
...
(Space : moreitems) -> Word [c] : Space : moreitems
Similarly, for Newline
:
(Newline : moreitems) -> Word [c] : Newline : moreitems
The remaining case handles the end of string:
[] -> Word [c] : []
If you've got a good eye, you'll see that all three of these cases can be handled by a single case, giving the answer:
toLineItems (c:rest) = case toLineItems rest of
(Word w : moreitems) -> Word (c:w) : moreitems
moreitems -> Word [c] : moreitems
Try out the complete definition:
toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems (c:rest) = case toLineItems rest of
(Word w : moreitems) -> Word (c:w) : moreitems
moreitems -> Word [c] : moreitems
and you'll see it works just fine.
The accumulator solution works like this. The idea is to write a helper function that gets invoked at the start of a word and uses an extra parameter, an "accumulator", to keep track of the work done so far (i.e., the characters of the word read so far). It "loops" by making recursive calls that add to this accumulator, and when it identifies the end of the word, it returns the accumulator value as the final answer. After handling the space and newline cases, we can invoke this helper with an empty accumulator:
toLineItems str = readWord "" str -- first arg is accumulator
The helper function should first check for the "end of word", being careful not to drop the space or newline that ends the word when it passes control back to toLineItems
.
readWord acc (' ':rest) = Word acc : toLineItems (' ':rest)
readWord acc ('\n':rest) = Word acc : toLineItems ('\n':rest)
readWord acc [] = Word acc : toLineItems []
and otherwise add the next character to the accumulator and call itself:
readWord acc (c:rest) = readWord (acc [c]) rest
This gives the solution:
toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems str = readWord "" str
where
readWord acc (' ':rest) = Word acc : toLineItems (' ':rest)
readWord acc ('\n':rest) = Word acc : toLineItems ('\n':rest)
readWord acc [] = Word acc : toLineItems []
readWord acc (c:rest) = readWord (acc [c]) rest
which also works fine. It is not the most efficient solution:
- At the start of the word, the first character -- which is already known to not be a space or a newline from matching by
toLineItems
is re-checked byreadWord
. - Accumulating a list by adding characters to the end with
acc [c]
is very inefficient in Haskell. - The first three patterns in
readWord
involve a lot of unnecessary duplication, since they all do roughly the same thing -- returnWord acc
and parse the rest withtoLineItems
. This is less an issue for program efficiency and more an issue of violating the DRY principle, making review of the code inefficient.
Here's a solution that addresses these problems:
toLineItems :: String -> [LineItem]
toLineItems (' ':rest) = Space : toLineItems rest
toLineItems ('\n':rest) = Newline : toLineItems rest
toLineItems "" = []
toLineItems (c:rest)= readWord [c] rest -- we know `c` is not whitespace
where
-- note: accumulator is reversed, because repeated `c:acc` is more
-- efficient than `acc [c]`
readWord acc (c:rest) | c `notElem` " \n" = readWord (c:acc) rest
readWord acc rest = Word (reverse acc) : toLineItems rest