is there any way to speed up the permuation creation for an anagramsolver?-CodePudding

I'm currently trying to make a very fast anagram solver, and right now it's bottlenecked by the creation of the permutations. is there another way to do the whole program or to optimize the permutation creation? here's my code:

#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <unordered_set>
#include <vector>
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/post.hpp>



void get_permutations(std::string s, std::vector<std::string> &permutations)
{
    std::sort(s.begin(), s.end());
    do
    {
        permutations.push_back(s);
    } while (std::next_permutation(s.begin(), s.end()));
}


void load_file(std::unordered_set<std::string> &dictionary, std::string filename)
{
    std::ifstream words(filename);
    std::string element;
    while (words >> element)
    {
        std::transform(element.begin(), element.end(), element.begin(), ::tolower);
        dictionary.insert(element);
    }
}

void print_valid(const std::unordered_set<std::string>& dictionary, const std::vector<std::string>::const_iterator start, const std::vector<std::string>::const_iterator stop)
{
    for (auto iter = start; iter != stop; iter  )
    {

        if (dictionary.contains(*iter) == true)
        {
            std::cout << *iter << "\n";
        }
    }
}

int main()
{
    const std::string s = "asdfghjklq";
    std::vector<std::string> permutations;
    boost::asio::thread_pool pool(2);
    std::cout << "Loading english dictionary\n";
    std::unordered_set<std::string> dictionary;
    load_file(dictionary, "words");
    std::cout << "Done\n";

    //std::cout << "Enter the anagram: ";
    //getline(std::cin, s);

    clock_t start = clock();


    get_permutations(s, permutations);
    //std::cout << permutations.size() << std::endl;

    std::cout << "finished permutations\n";

    
    if (permutations.size() > 500000)
    {
        std::cout << "making new\n";
        for (size_t pos = 0; pos < permutations.size(); pos  = (permutations.size() / 3))
        {
            boost::asio::post(pool, [&dictionary, &permutations, pos] { print_valid(dictionary, (permutations.begin()   pos), (permutations.begin()   pos   (permutations.size() /3) ) ); });
        }
        pool.join();
    } 
    else
    {
        print_valid(dictionary, permutations.begin(), permutations.end());
    }

    

    clock_t finish = clock();
    double time_elapsed = (finish - start) / static_cast<double>(CLOCKS_PER_SEC);
    std::cout << time_elapsed << "\n";
    std::cout << permutations.size() << std::endl;

    return 0;
}

the creation of permutations is in get_permutations the thread pooling was something to test for very large sets of permutations

CodePudding user response：

Think about how you would go about this by hand - how do you check if two words are anagrams of each other?

e.g.: banana <-> aaannb

How would you solve this on a piece of paper? Would you create all 720 permutations and check if any one matches? Or is there an easier, more intuitive way?

So what makes a word an anagram of another word, i.e. what condition needs to be satisfied?

It's all about letter counts. If both words contain an equal amount of all letters, they're anagrams of each other.

e.g.:

banana -> 3x a, 2x n, 1x b
aaannb -> 3x a, 2x n, 1x b
same letter counts so they must be anagrams!

So armed with this knowledge can you construct an algorithm that doesn't require iterating all possible permutations?

Solution

I'd only recommend to read this once you've tried to come up with your own optimized algorithm

You just need to build a lookup-table of letter-counts to dictionary words, e.g.:

1x a, 1x n -> ["an"]
3x a, 1x b, 2x n -> ["banana", "nanaba"]
1x a, 1x p, 1x r, 1x t -> ["part", "trap"]
... etc ...

then you can decompose your search word as well into letter counts, e.g. banana -> 3x a, 1x b, 2x n and search for the decomposition in your lookup table.

The result will be the list of words from your dictionary you can build with the given collection of letters - aka all possible anagrams for the given string.

aussuming some kind of structure named letter_counts that contains the letter composition the algorithm could look like:

std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
    // build a lookup map for letter composition -> word
    std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
    for(auto& str : dictionary)
        compositionMap[letter_counts{str}].push_back(str);

    // get all words that are anagrams of the given one
    auto it = compositionMap.find(letter_counts{wordToCheck});
    // no matches in dictionary
    if(it == compositionMap.end())
        return {};

    // list of all anagrams
    auto result = it->second;

    // remove workToCheck from result if it is present
    result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());

    return result;
}

This will run in O(n) time and has a space-complexity of O(n), with n being the number of words in the dictionary.

(It would be armortized O(1) time if you don't include the construction of the compositionMap as part of the algorithm)

In comparison to a permutation-based approach, that has O(n!) time complexity (or how i like to call it O(scary)).

Here's a full code example that only deals with letters a-z, but you can easily modify letter_counts to make it work with other characters as well:

godbolt example

#include <string_view>
#include <cctype>
#include <vector>
#include <string>
#include <unordered_map>
#include <iostream>

struct letter_counts {
    static const int num_letters = 26;
    int counts[num_letters];

    explicit letter_counts(std::string_view str) : counts{0} {
        for(char c : str) {
            c = std::tolower(c);
            if(c >= 'a' && c <= 'z')
                counts[c - 'a']  ;
        }
    }
};

bool operator==(letter_counts const& lhs, letter_counts const& rhs) {
    for(int i = 0; i < letter_counts::num_letters; i  ) {
        if(lhs.counts[i] != rhs.counts[i]) return false;
    }

    return true;
}

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    seed ^= hasher(v)   0x9e3779b9   (seed<<6)   (seed>>2);
}

namespace std {
    template<>
    struct hash<letter_counts> {
        size_t operator()(const letter_counts& letterCounts) const
        {
            size_t result = 0;
            auto hasher = std::hash<int>{};
            for(int i : letterCounts.counts)
                hash_combine(result, hasher(i));

            return result;
        }
    };
}



std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
    // build a lookup map for letter composition -> word
    std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
    for(auto& str : dictionary)
        compositionMap[letter_counts{str}].push_back(str);

    // get all words that are anagrams of the given one
    auto it = compositionMap.find(letter_counts{wordToCheck});
    // no matches in dictionary
    if(it == compositionMap.end())
        return {};

    // list of all anagrams
    auto result = it->second;

    // remove workToCheck from result if it is present
    result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());

    return result;
}

int main() {
    std::vector<std::string> dict = {
        "banana",
        "nanaba",
        "foobar",
        "bazinga"
    };

    std::string word = "aaannb";

    for(auto& str : find_anagrams(dict, word)) {
        std::cout << str << std::endl;
    }
}

CodePudding user response：

The permutation method you have is way too slow, especially since the number of permutations of a string of n distinct characters scales super-exponentially. Try something like hashing and an equality predicate, where the hash is based on the sorted string, and the equality predicated only tests if the sorted version of 2 strings are equal. You can use boost::unordered_map to create custom hash functions and add words which fit the anagram to the key set.

CodePudding user response：

Note that the number of combinations have a tendency to become very large very quickly. Two words are anagrams if you sort the characters of both words alphabetically and then the sorted strings match up. Based on that fact I made the following example that puts a dictionary into a multimap where it is possible to find all anagrams of a word quickly. It does this by using the alphabetically sorted input string as key into the map.

Live demo : https://onlinegdb.com/fXUVZruwq

#include <algorithm>
#include <iostream>
#include <locale>
#include <map>
#include <vector>
#include <set>

// create a class to hold anagram information
class anagram_dictionary_t
{
public:
    
    // create a dictionary based on an input list of words.
    template<typename std::size_t N>
    explicit anagram_dictionary_t(const std::string (&words)[N])
    {
        for (std::string word : words)
        {
            auto key = make_key(word);
            std::string lower{ word };
            to_lower(lower);
            m_anagrams.insert({ key, lower});
        }
    }

    // find all the words that match the anagram
    auto find_words(const std::string& anagram)
    {
        // get the unique key for input word
        // this is done by sorting all the characters in the input word alphabetically
        auto key = make_key(anagram);

        // lookup all the words with the same key in the dictionary
        auto range = m_anagrams.equal_range(key);

        // create a set of found words
        std::set<std::string> words;
        for (auto it = range.first; it != range.second;   it)
        {
            words.insert(it->second);
        }

        // return the words
        return words;
    }

    // function to check if two words are an anagram
    bool is_anagram(const std::string& anagram, const std::string& word)
    {
        auto words = find_words(anagram);
        return (words.find(word) != words.end());
    }

private:
    // make a unique key out of an input word
    // all anagrams should map to the same key value
    static std::string make_key(const std::string& word)
    {
        std::string key{ word };
        to_lower(key);

        // two words are anagrams if they sort to the same key
        std::sort(key.begin(), key.end());
        return key;
    }

    static void to_lower(std::string& word)
    {
        for (char& c : word)
        {
            c = std::tolower(c, std::locale());
        }
    }

    std::multimap<std::string, std::string> m_anagrams;
};

int main()
{
    anagram_dictionary_t anagram_dictionary{ {"Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry", "Blueberry" } };
    
    std::string anagram{ "aaannb"};
    auto words = anagram_dictionary.find_words(anagram);
    
    std::cout << "input word = " << anagram << "\n found words : ";
    for (const auto& word : words)
    {
        std::cout << word << "\n";
    }

    return 0;
}