I'm trying to create a function in python that from a list of strings will return me a dict where the key(index) shows the most frequent character for each index between all the strings. for example a list1 = ['one', 'two', 'twin', 'who'] should return index 0=t index 1=w index 2=o index 3=n in fact the most frequent character at the index 1 between all the string is 'w'. I found a solution but if I have lists with thousands of strings inside it will require too much time to perform. I would like to know if you can give me some help to decrease the time of execution, without importing any libraries.
Here is what I tried to do but seems too slow to perform with lists of thousands strings inside
list1 = ['one', 'two', 'twin', 'who']
width = len(max(list1, key=len))
chars = {}
for i, item in enumerate(zip(*[s.ljust(width) for s in list1])):
set1 = set(item)
if ' ' in set1:
set1.remove(' ')
chars[i] = max(set1, key=item.count)
print(chars)
CodePudding user response:
Whether something is quick enough is a matter of use case, but this solution uses a couple of seconds to go through the default wordlist available under OS X.
Python's collections.Counter
implements a counter object for you, so you don't have keep track of the counts of multiple possible values yourself.
I've paired it with defaultdict
, which intializes a key with a function if the key is undefined - so that if we haven't already seen the index we're updating the count for, it gets initialized to a Counter object that we then update.
from collections import defaultdict, Counter
with open("/usr/share/dict/words") as f:
words = f.read().splitlines()
letters = defaultdict(lambda: Counter())
for word in words:
for idx, letter in enumerate(word):
letters[idx].update((letter, ))
for idx, counter in letters.items():
print(idx, counter.most_common(1))
Whether this is quick enough depends on your use case as mentioned; it can be done a lot quicker if necessary, but it's probably quick enough. For 235 886 words the runtime is:
python3 letterfreq.py 2.67s user 0.04s system 99% cpu 2.734 total
This assumes that every word is lowercased, if not, lowercase it before adding it to your Counter object.
If you want to implement it without using the Counter
or defaultdict
parts of the standard library (which are just helper functionality to avoid reimplementing the same small code repeatedly), you can do the exact thing yourself manually:
with open("/usr/share/dict/words") as f:
words = f.read().splitlines()
letter_positions = {}
for word in words:
for idx, letter in enumerate(word):
if idx not in letter_positions:
letter_positions[idx] = {}
if letter not in letter_positions[idx]:
letter_positions[idx][letter] = 0
letter_positions[idx][letter] = 1
final_dict = {}
for idx, counts in letter_positions.items():
most_popular = sorted(counts.items(), key=lambda v: v[1], reverse=True)
print(idx, most_popular)
final_dict[idx] = most_popular[0][0]
print(final_dict)
Then pick as many entries as necessary from most_popular
when going through the list afterwards.
Since we're no longer using the defaultdict
and Counter
abstractions, our running time is now about a third of the previous one:
python3 letterfreq2.py 1.08s user 0.03s system 98% cpu 1.124 total
It's usually a good idea to go through what you're trying to do and formulate a strategy - i.e. "ok, I need to keep track of how many times a letter has appeared in this location .. so for that I need some way to keep values for each index .. and then for each letter ..".
CodePudding user response:
I just make some improvements based on your algorithm.
First, you can use itertools.zip_longest() instead of zip() to remove the need of ljust()
and the width
variable:
from itertools import zip_longest
list1 = ['one', 'two', 'twin', 'who']
chars = {}
for i, item in enumerate(zip_longest(*list1)):
set1 = set(item)
if None in set1:
set1.remove(None)
chars[i] = max(set1, key=item.count)
print(chars)
Then, replace max(set1, key=item.count)
with a more efficent way Counter(item).most_common(1)[0][0]
, combined with or set1.most_common(2)[1][0]
to filter None
values
from itertools import zip_longest
from collections import Counter
list1 = ['one', 'two', 'twin', 'who']
chars = {}
for i, item in enumerate(zip_longest(*list1)):
set1 = Counter(item)
chars[i] = set1.most_common(1)[0][0] or set1.most_common(2)[1][0]
print(chars)
As itertools
and collections
are Python built-in modules, you can import them directly without pip install
them.