StringTokenizer to Scala Iterator-CodePudding

I am trying to convert java.util.StringTokenizer to Scala's Iterator and the following approach fails:

def toIterator(st: StringTokenizer): Iterator[String] =
   Iterator.continually(st.nextToken()).takeWhile(_ => st.hasMoreTokens()))

But this works:

def toIterator(st: StringTokenizer): Iterator[String] =
    Iterator.fill(st.countTokens())(st.nextToken())

You can see this in Scala console:

scala> Iterator("a b", "c d").map(new java.util.StringTokenizer(_)).flatMap(st => Iterator.continually(st.nextToken()).takeWhile(_ => st.hasMoreTokens())).toList
res1: List[String] = List(a, c)

scala> Iterator("a b", "c d").map(new java.util.StringTokenizer(_)).flatMap(st => Iterator.fill(st.countTokens())(st.nextToken())).toList
res2: List[String] = List(a, b, c, d)

As you can see res1 is incorrect and res2 is correct. What am I doing wrong? The first version should work and is preferred since it is 2x faster than second approach since it does not scan the string twice

CodePudding user response：

takeWhile is not intended to be used statefully. It should take a pure function that determines, solely based on the input, whether or not to continue.

Specifically, the iterator must produce the value before the takeWhile predicate gets called. Even though your function ignores the takeWhile argument, it still gets evaluated. So nextToken gets called and then we check for more tokens.

To be perfectly precise, in your "a b" case,

First, we call nextToken, which is what Iterator.continually does. There's a next token, so it returns "a".
Now, to determine if we should include the next token, we call your predicate with "a" as argument. Your predicate ignores "a" and calls hasMoreTokens. Our tokenizer has more tokens (namely, "b"), so it returns true. Continue.
Now we call nextToken again. This returns "b".
We need to determine if we should include this in our result, so our takeWhile predicate runs with "b" as argument. Our takeWhile predicate ignores its argument and calls hasMoreTokens. We have no more tokens anymore, so this returns false. We should not include this element.
takeWhile has returned false, so we stop at the last element for which it returned true. Our resulting list is List("a").

Since we're abusing a pure functional technique like takeWhile to be stateful, we get unintuitive results.

As much as it looks snazzy and clever to have a one-line solution, what you have is a stateful, imperative object that you want to adapt to the Iterator interface. Hiding that statefulness in a bunch of pure function calls is not a good idea, so we should just write our own subclass of Iterator and do it properly.

import java.util.StringTokenizer

final class StringTokenizerIterator(
  private val tokenizer: StringTokenizer
) extends Iterator[String] {

  def hasNext: Boolean = tokenizer.hasMoreTokens

  def next(): String = tokenizer.nextToken()

}

object Example {

  def toIterator(st: StringTokenizer): Iterator[String] =
    new StringTokenizerIterator(st)

  def main(args: Array[String]) = {
    println(Iterator("a b", "c d")
            .map(new java.util.StringTokenizer(_))
            .flatMap(toIterator(_))
            .toList)
  }

}

We're doing the same work you were doing calling the appropriate StringTokenizer functions, but we're doing it in a full class that encapsulates the state, rather than pretending the stateful part is not there. It's longer code, yes, but it should be longer. We don't want the messy parts of it to go unnoticed.