Home > Software engineering >  Word count in haskell
Word count in haskell

Time:07-07

I'm working on this exercise:

Given a phrase, count the occurrences of each word in that phrase.

For the purposes of this exercise you can expect that a word will always be one of:

A number composed of one or more ASCII digits (ie "0" or "1234") OR A simple word composed of one or more ASCII letters (ie "a" or "they") OR A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're") When counting words you can assume the following rules:

The count is case insensitive (ie "You", "you", and "YOU" are 3 uses of the same word) The count is unordered; the tests will ignore how words and counts are ordered Other than the apostrophe in a contraction all forms of punctuation are ignored The words can be separated by any form of whitespace (ie "\t", "\n", " ") For example, for the phrase "That's the password: 'PASSWORD 123'!", cried the Special > Agent.\nSo I fled. the count would be:

that's: 1 the: 2 password: 2 123: 1 cried: 1 special: 1 agent: 1 so: 1 i: 1 fled: 1

My code:

module WordCount (wordCount) where

import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    ys <- words xs
    let zs = R.getAllTextMatches (ys =~ "\\d |\\b[a-zA-Z'] \\b") :: [String]
    g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
    return (head g, length g)

But it fails on the input "one fish two fish red fish blue fish". It outputs one count for each word, even the repeated ones, as if the sort and group aren't doing anything. Why?

I've read this answer, which basically does the same thing in a more advanced way using Control.Arrow.

CodePudding user response:

You don't need to use words to split the line, the regex should achieve the desired splitting:

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    let zs = R.getAllTextMatches (xs =~ "\\d |\\b[a-zA-Z'] \\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- zs]
    return (head g, length g)

CodePudding user response:

wordCount xs =
  do
    ys <- words xs
    let zs = R.getAllTextMatches (ys =~ "\\d |\\b[a-zA-Z'] \\b") :: [String]
    g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
    return (head g, length g)

You’re splitting the input xs into words by whitespace using words. You iterate over these in the list monad with the binding statement ys <- …. Then you split each of those words into subwords using the regular expression, of which there happens to be only one match in your example. You sort and group each of the subwords in a list by itself.

I believe you can essentially just delete the initial call to words:

wordCount xs =
  do
    let ys = R.getAllTextMatches (xs =~ "\\d |\\b[a-zA-Z'] \\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- ys]
    return (head g, length g)
  • Related