Home > Software engineering >  Remove duplicate words from a sentence without using regex in c ?
Remove duplicate words from a sentence without using regex in c ?

Time:01-14

I would like to remove duplicate word from the sentence without using regex in c . For example:

Input: Welcome to to the world of of digital computers computers.

Expected Output: Welcome to the world of digital computers.

Actual output: Welcome to the world of digital computers computers.

How I can manage that in the below code ??

void removeDupWord(string str)
{
    // Used to split string around spaces.
    istringstream ss(str);

    // To store individual visited words
    unordered_set<string> hsh;

    // Traverse through all words
    do
    {
        string word;
        ss >> word;

        // If current word is not seen before.
        while (hsh.find(word) == hsh.end()) {
            cout << word << " ";
            hsh.insert(word);
        }

    } while (ss);
}

CodePudding user response:

std::set and std::unordered_set can't store duplicate elements. Instead of explicitly calling find method on the set, you can use the values returned by insert method to determine whether the insertion was successful or not.

#include <iostream>
#include <unordered_set>
#include <string>
#include <sstream>

void removeDupWord(std::string str)
{
    // Used to split string around spaces.
    std::istringstream ss(str);

    // To store individual visited words
    std::unordered_set<std::string> hsh;

    std::string word;
    // Traverse through all words
    while(ss >> word)
    {
        // Remove punctuation char from end
        if(std::ispunct(word.back()))
            word.pop_back();

        const auto& [itr, result] = hsh.insert(word);
        if( result )
            std::cout << word << ' ';   
    }

    // output
    // Welcome to the world of digital computers
}

int main()
{
    std::string input {"Welcome to to the world of of digital computers computers."};
    removeDupWord(input);
}

CodePudding user response:

Following from the discussion in the comments, I tinkered a bit with a way to accomplish your goal of removing duplicate words (only if adjacent duplicates) and preserving the assumed punctuation at the end of the sentence. With tasks such as these, it's often easier to simply use a loop and a couple of variables to preserve the last word written and punctuation than trying to use some type of container to store and compare words.

The following does just that. It loops over each word, checking if the last character is punctuation with std::ispunct(), preserving the punctuation for appending later. To output the words a simple first state variable is used to determine the first output, allowing the space to be inserted before the next word rather than afterwards. That allows any punctuation removed for word comparison to be appended to the output as needed.

Below the program takes the sentence to process as the first argument to the program (could have just as easily read it from stdin -- up to you) and then outputs the corrected sentence below, e.g.

#include <iostream>
#include <string>
#include <sstream>
#include <cctype>

void removeDupWord(std::string str)
{
    bool first = true;            /* flag indicating 1st word output */
    std::istringstream ss(str);   /* initialize stringsream with str */
    std::string word, last {};
   
    while(ss >> word)     /* iterate over each word */
    {
        char c = 0;       /* if last char is punctuation, save in c */
        
        /* remove any punctuation from word */
        if (std::ispunct((unsigned char)word.back())) {
            c = word.back();
            word.pop_back();
        }
        
        if (word != last) {             /* if not duplicate of last word */
            if (!first) {               /* if not 1st word */
            std::cout << " " << word;   /* prepend space */
            }
            else {  /* otherwise */
                std::cout << word;      /* output word alone */
                first = false;          /* set first flag false */
            }
        }
        else if (c) {  /* otherwise word is adjacent duplicate */
            std::cout << c;  /* output trailing punctuation */
        }
        last = word;      /* update last to current word */
    }
    
    std::cout.put('\n');  /* tidy up with newline */
}

int main(int argc, char **argv)
{
    if (argc != 2) {  /* ensure argument provided */
        return 1;
    }
    std::string input { argv[1] };  /* initialize string with argument */
    removeDupWord (input);          /* remove adjacent duplicate words */
}

Example Use/Output

With your example sentence:

$ ./bin/unordered_set_rm_dup_words "Welcome to to the world of of digital computers computers."
Welcome to the world of digital computers.

And the longer example from the discussion with @PeteBecker in the comments:

$ ./bin/unordered_set_rm_dup_words "If I go go to Mars I would need to turn left left to go to Venus Venus."
If I go to Mars I would need to turn left to go to Venus.

It's up to you whether you simply compare, or choose to use a std::unordered_set, etc.. to store the words and compare. This, of course, turns on whether you want to remove ALL duplicate words, or just adjacent-duplicate words. Hard to tell what you need from your given example -- but my second example above, you can see the downside to removing ALL rather than just adjacent duplicates.

Any additional punctuation handling is left as an exercise for you. Let me know if you have questions.

  • Related