Improvement on len(sentence.split()) for finding the number of words in a sentence-CodePudding

The standard approach to couting the number of words in a sentence in python seems to be to use len(sentence.split(<separator character(s)>)). However this has always struck me as very inefficient as you are creating an entire list just to count the number of separator characters not preceded by another separator character.

I know it would be fairly simple to write a function to do the same job but

a) this is a bit clunky to do wherever you need to do it and

b) because the function would be written in python rather than an in-built C function it probably wouldn't be much more efficient either.

So anyone think of an efficient one-liner to take the place of len(string.split())?

edit: To clarify string.count(" ") 1 does not fulfil the same purpose as len(string.split(" ")) count counts leading, trailing and repeated character so it is not a good method for counting the number of words as it only works in the idealised case where the string starts with a word, ends with a word and has no double spaces

CodePudding user response：

In terms of time, len(string.split()) would be hard if not impossible to beat. But if space is a concern it might be possible to improve.

Let's start by taking a looking at the built in split method

i = j = 0;
while (maxcount-- > 0) {
    while (i < str_len && STRINGLIB_ISSPACE(str[i]))
        i  ;
    if (i == str_len) break;
    j = i; i  ;
    while (i < str_len && !STRINGLIB_ISSPACE(str[i]))
        i  ;
#if !STRINGLIB_MUTABLE
    if (j == 0 && i == str_len && STRINGLIB_CHECK_EXACT(str_obj)) {
        /* No whitespace in str_obj, so just use it as list[0] */
        Py_INCREF(str_obj);
        PyList_SET_ITEM(list, 0, (PyObject *)str_obj);
        count  ;
        break;
    }
#endif
    SPLIT_ADD(str, j, i);
}

maxcount: is a parameter, it can limit the string to only be split x number of times. If not set there is no limit.

The function loops over the string looking for whitespace and splits the words into a list accordingly. The time complexity would be O(n) and the space complexity would also be O(n) as it would require auxiliary space for a List the size of the string.

An improvement in space but not time in my testing could be done by writing a function that loops over the string counting spaces. My function looked like this:

def count_words(s):
    space_prev = True
    word_count = 0
    index = 0

    while index < len(s):
        if s[index] == ' ' and not space_prev:
            word_count  = 1
            space_prev = True
        elif space_prev and s[index] != ' ':
            space_prev = False
        index  = 1

    if space_prev:
        return word_count

    return word_count   1

This function has a time complexity of O(n) and space complexity of O(1). Although by being written in Python it runs slower than the builtin C function despite having the same algorithmic runtime and saving space.

Here's my test timings (first number is time, second is space):

String with 69 words:
---------------------
builtin: 0.0009369159999999994, 11.567104 mb
count_words: 0.006606459, 11.583488 mb

String with 101 words:
----------------------
builtin: 0.0009850830000000012, 11.517952 mb
count_words: 0.009976667000000002, 11.567104 mb

String with 1,010 words:
------------------------
builtin: 0.007890958999999996, 11.632640 mb
count_words: 0.101723584, 11.567104 mb

String with 10,100 words:
-------------------------
builtin: 0.044037375, 11.911168 mb
count_words: 1.026273333, 11.747328 mb

In the end the space was quite negligible as well. So unless you have a huge amount of words to count it's not worth the effort.

CodePudding user response：

As mentioned in a comment, you can use

string.count(" ")   1

which is, according to a quick test I just did, faster. You will, however, quickly run into problems if there are consecutive spaces ("I am" will say 3 words) or the string has leading or trailing spaces (" I am " would be 4 words). Long story short, split() is optimized to accurately separate the words, so it will deal with a lot of headaches by itself, even if it is slower.

Edit: I did some further testing

Here is some benchmarking data:

from timeit import timeit

string = 'During a combat medical training class, the topic was blast injuries. At one point, our very intimidating instructor pointed at me and said, “There’s been a jeep explosion. What would you do if you came upon an injured man with a steering wheel embedded in his chest?” Nervous and unsure, I blurted out, “Drive him to the hospital?” For some reason, the rest of the room found this hilarious.'

def builtin_split():
    return len(string.split())

def builtin_count():
    return string.count(" ")   1

def improved_count():
    # Takes care of leading, trailing and double spaces.
    # Doesn't take care of triple and longer
    stripped_string = string.strip()
    return stripped_string.strip().count(" ")   1 - stripped_string.count("  ")

def custom_implementation():
    # Only uses spaces as separators
    s = string.strip()
    length = len(s)
    words = 0
    i = 0
    while i < length:
        while i < length and s[i] != " ":
            i  = 1
        while i < length and s[i] == " ":
            i  = 1
        words  = 1
    
    return words


print("split:" ,timeit(builtin_split))
print("count:", timeit(builtin_count))
print("improved count:", timeit(improved_count))
print("custom implementation:", timeit(custom_implementation))

Output:

split: 4.68428400799985
count: 1.250697097000284
improved count: 2.0560090240001045
custom implementation: 123.93436312599988

Using just the simple count is around 3.5x faster, but as mentioned, unreliable. When you improve count to take care of trailing and leading spaces, as well as double spaces, it is still more than 2x faster than using split. That would probably cover the vast majority of errors of count, but would still fail to get the correct result on a string with triple or more spaces ("I am"). I would say that is generally still nowhere near worth the tiny performance boost. As expected, a custom implementation in Python is immeasurably slower.