An algorithm reading N words and printing all anagrams-CodePudding

The original problem is here:

Design a O(N log N) algorithm to read in a list of words and print out all anagrams. For example, the strings "comedian" and "demoniac" are anagrams of each other. Assume there are N words and each word contains at most 20 letters. Designing a O(N^2) algorithms should not be too difficult, but getting it down to O(N log N) requires some cleverness.

I am confused since the length of a word does not depend on N. It is a constant number 20. So I thought we can multiply the running time for one word by N. Hence the result will be O(N). However, it seems I miss something.

CodePudding user response：

If they insist on hitting that O(n log n) algorithm (because I think you can do better), a way is as follows:

Iterate over the array and sort each word individually
Keep a one to one map of each sorted word its original (in the form of a tuple for example)
Sort that newly created list by the sorted words.
Iterate over the sorted list, extract the words with equal sorted counterparts (they are adjacent now) and print them.

Example:

array = ["abc", "gba", "bca"]

Sorting each word individually and keeping the original word gives:
new_array = [("abc", "abc"), ("abg", "gba"), ("abc", "bca")]

Sorting the whole array by the first element gives:
new_array = [("abc", "abc"), ("abc", "bca"), ("abg", "gba")]

Now we iterate over the above array and extract words with equal first elements in the tuple, which gives
[("abc", "abc"), ("abc", "bca")] => ["abc", "bca"]

Time complexity analysis:

Looping over the original array is linear n.
sorting each individual word in constant because they will never exceed 20 characters. It's just 20 * log 20 = 100 roughly.
sorting the whole array is linearithmic in n.
the resulting time complexity becomes O(n * log n) where n is the length of the input array.

CodePudding user response：

An algorithm

We are going to find an "identifier" for every class of anagrams. An identifier should be something that:

is unique to this class: no two classes have the same identifier;
can be computed when we're given a single word of the class: given two different words of the same class, we should compute the same identifier.

Once we've done that, all we have to do is group together the words that have the same identifier. There are several different ways of grouping words that have the same identifier; the main two ways are:

sorting the list of words, using the identifier as a comparison key;
using a "map" data structure, for instance a hash table or a binary tree.

Can you think of a good identifier?

An identifier I can think of is the list of letters of the words, in alphabetical order. For instance:

comedian  -->  acdeimno
dog  -->  dgo
god  -->  dgo
hello  -->  ehllo
hole  -->  ehlo
demoniac  -->  acdeimno

Implementation in python

words = 'comedian dog god hello hole demoniac'.split()

d = {}
for word in words:
    d.setdefault(''.join(sorted(word)), []).append(word)

print(list(d.values()))

[['comedian', 'demoniac'], ['dog', 'god'], ['hello'], ['hole']]

The explanation

The most important thing here is that for each word, we computed ''.join(sorted(word)). That's the identifier I mentioned earlier. In fact, I didn't write the earlier example by hand; I printed it with python using the following code:

for word in words:
    print(word, ' --> ', ''.join(sorted(word)))

comedian  -->  acdeimno
dog  -->  dgo
god  -->  dgo
hello  -->  ehllo
hole  -->  ehlo
demoniac  -->  acdeimno

So what is this? For each class of anagrams, we've made up a unique word to represent that class. "comedian" and "demoniac" both belong to the same class, represented by "acdeimno".

Once we've managed to do that, all that is left is to group the words which have the same representative. There are a few different ways to do that. In the python code, I have used a python dict, a dictionary, which is effectively a hashtable mapping the representative to the list of corresponding words.

Another way, if you don't know about map data structures, is to sort the list, which takes O(N log N) operations, using the representative as the comparison key:

print( sorted(words, key=lambda word:''.join(sorted(word))) )

['comedian', 'demoniac', 'dog', 'god', 'hello', 'hole']

Now, all words that belong to the same class of synonyms are adjacent. All that's left for you to do is iterate through this sorted list, and group elements which have the same key. This is only O(N). So the longest part of the algorithm was sorting the list.

CodePudding user response：

A - You can do it in o(n) by using a hash table (or a dict in python)

B - Then you add of for i in 1..sqrt(n) at each step

This is the best way to make a n.sqrt(n) where a o(n) algorithm exists.

0 - Let d=dict() a python dictionnary

1 - Iterate over the array:

for each word w, let s=word sorted by increasing letter value

if s already in d, add w to d[s]

if not d[s]=[w]

for i in 1.. sqrt(n) : do nothing // needed to slow from o(n) to o(n.sqrt())

2 - Print anagrams

foreach (k,l) in d

if len(l)>1 print l