Home > OS >  Counting words in an input string in C **with consideration for typos
Counting words in an input string in C **with consideration for typos

Time:08-03

I've been looking for ways to count the number of words in a string, but specifically for strings that may contain typos (i.e. "_This_is_a___test" as opposed to "This_is_a_test"). Most of the pages I've looked at only handle single spaces.

This is actually my first time programming in C , and I don't have much other programming experience to speak of (2 years of college in C and Java). Although what I have is functional, I'm also aware it's complex, and I'm wondering if there is a more efficient way to achieve the same results?

This is what I have currently. Before I run the string through numWords(), I run it through a trim function that removes leading whitespace, then check that there are still characters remaining.

int numWords(string str) {
    int count = 1;
    for (int i = 0; i < str.size(); i  ) {
        if (str[i] == ' ' || str[i] == '\t' || str[i] == '\n') {
            bool repeat = true;
            int j = 1;
            while (j < (str.size() - i) && repeat) {
                if (str[i   j] != ' ' && str[i   j] != '\t' && str[i   j] != '\n') {
                    repeat = false;
                    i = i   j;
                    count  ;
                }
                else
                    j  ;
            }
        }
    }
    return count;
}

Also, I wrote mine to take a string argument, but most of the examples I've seen used (char* str) instead, which I wasn't sure how to use with my input string.

CodePudding user response:

You don't need all those stringstreams to count word boundary

#include <string>
#include <cctype>

int numWords(std::string str) 
{   
    bool space = true; // not in word
    int count = 0;
    for(auto c:str){
        if(std::isspace(c))space=true;
        else{
            if(space)  count;
            space=false;
        }
    }
    return count;
}

CodePudding user response:

One solution is to utilize std::istringstream to count the number of words and to skip over spaces automatically.

#include <sstream>
#include <string>
#include <iostream>

int numWords(std::string str) 
{
    int count = 0;
    std::istringstream strm(str);
    std::string word;
    while (strm >> word)
        count;
   return count;
}

int main()
{
   std::cout << numWords("  This    is a test  ");
}

Output:

4

Albeit as mentioned std::istringstream is more "heavier" in terms of performance than writing your own loop.

CodePudding user response:

you can do it easily with regex

int numWords(std::string str) 
{
    std::regex re("\\w ");
    return std::distance(
        std::sregex_iterator(str.begin(), str.end(), re),
        std::sregex_iterator()
    );
}

CodePudding user response:

Sam's comment made me write a function that does not allocate strings for words. But just creates string_views on the input string.

#include <cassert>
#include <cctype>
#include <vector>
#include <string_view>
#include <iostream>

std::vector<std::string_view> get_words(const std::string& input)
{
    std::vector<std::string_view> words;

    // the first word begins at an alpha character
    auto begin_of_word = std::find_if(input.begin(), input.end(), [](const char c) { return std::isalpha(c); });
    auto end_of_word = input.begin();
    auto end_of_input = input.end();

    // parse the whole string
    while (end_of_word != end_of_input)
    {
        // as long as you see text characters move end_of_word one back
        while ((end_of_word != end_of_input) && std::isalpha(*end_of_word)) end_of_word  ;

        // create a string view from begin of word to end of word.
        // no new string memory will be allocated
        // std::vector will do some dynamic memory allocation to store string_view (metadata of word positions) 
        words.emplace_back(begin_of_word, end_of_word);

        // then skip all non readable characters.
        while ((end_of_word != end_of_input) && !std::isalpha(*end_of_word) ) end_of_word  ;

        // and if we haven't reached the end then we are at the beginning of a new word.
        if ( end_of_word != input.end()) begin_of_word = end_of_word;
    }

    return words;
}

int main()
{
    std::string input{ "This, this   is a   test!" };
    auto words = get_words(input);

    for (const auto& word : words)
    {
        std::cout << word << "\n";
    }

    return 0;
}

CodePudding user response:

You can use standard function std::distance with std::istringstream the following way

#include <iostream>
#include <sstream>
#include  <string>
#include <iterator>

int main()
{
    std::string s( " This is a   test" );
    std::istringstream iss( s );

    auto count = std::distance( std::istream_iterator<std::string>( iss ),
                                std::istream_iterator<std::string>() );

    std::cout << count << '\n';                                
}

The program output is

4

If you want you can place the call of std::distance in a separate function like

#include <iostream>
#include <sstream>
#include  <string>
#include <iterator>

size_t numWords( const std::string &s )
{
    std::istringstream iss( s );
    return std::distance( std::istream_iterator<std::string>( iss ),
                          std::istream_iterator<std::string>() );
}

int main()
{
    std::string s( " This is a   test" );

    std::cout << numWords( s ) << '\n';                                
}

If separators can include other characters apart from white space characters as for example punctuations then you should use methods of the class std::string or std::string_view find_first_of and find_first_not_of.

Here is a demonstration program.

#include <iostream>
#include <string>
#include <string_view>

size_t numWords( const std::string_view s, std::string_view delim = " \t" )
{
    size_t count = 0;

    for ( std::string_view::size_type pos = 0;
          ( pos = s.find_first_not_of( delim, pos ) ) != std::string_view::npos;
            pos = s.find_first_of( delim, pos ) )
    {
          count;
    }

    return count;        
}
 

int main()
{
    std::string s( "Is it a   test ? Yes ! Now we will run it ..." );

    std::cout << numWords( s, " \t!?.," ) << '\n';                                
}

The program output is

10
  • Related