I want to count the number of two character vowel permutations that are contained within a list of 5-letter words. Vowel permutations are like 'aa','ae','ai,...,'ui','uo','uu'
.
I successfully get this done using apply()
but it is slow. I want to see if there is a fast vectorized way to get this done. I can't think of one.
Here's what I did:
import pandas as pd
import itertools
vowels = list('aeiou')
vowel_perm = [x[0] x[1] for x in itertools.product(vowels,vowels)]
def wide_contains(x):
return pd.Series(data=[c in x for c in vowel_perm], index=vowel_perm)
dfwd['word'].apply(wide_contains).sum()
aa 1
ae 2
ai 12
ao 2
au 8
ea 15
ee 15
ei 1
eo 5
eu 2
ia 7
ie 10
ii 0
io 3
iu 0
oa 2
oe 2
oi 3
oo 11
ou 7
ua 2
ue 9
ui 2
uo 0
uu 0
The above is the expected output using the following data
word_lst = ['gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid',
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute',
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch',
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge',
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali',
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie',
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie',
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch']
dfwd = pd.DataFrame(word_lst, columns=['word'])
CodePudding user response:
Well, if not using Pandas to do this computation at all is alright, it looks like plain old Counter()
is 220x faster on my machine for the data given.
from itertools import Counter
import timeit
def timetest(func, name=None):
name = name or getattr(func, "__name__", None)
iters, time = timeit.Timer(func).autorange()
iters_per_sec = iters / time
print(f"{name=} {iters=} {time=:.3f} {iters_per_sec=:.2f}")
def counter():
ctr = Counter()
for word in dfwd['word']:
for perm in vowel_perm:
if perm in word:
ctr[perm] = 1
return ctr
timetest(original)
timetest(counter)
print(counter())
outputs
name='original' iters=10 time=0.229 iters_per_sec=43.59
name='counter' iters=2000 time=0.212 iters_per_sec=9434.29
Counter({'ea': 15, 'ee': 15, 'ai': 12, 'oo': 11, 'ie': 10, 'ue': 9, 'au': 8, 'ou': 7, 'ia': 7, 'eo': 5, 'io': 3, 'oi': 3, 'oa': 2, 'ui': 2, 'ao': 2, 'ua': 2, 'eu': 2, 'oe': 2, 'ae': 2, 'ei': 1, 'aa': 1})
CodePudding user response:
How about some dictionary comprehension? It should be faster than using apply
{v: dfwd['word'].str.count(v).sum() for v in vowel_perm}
# 6.9 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{'aa': 1,
'ae': 2,
'ai': 12,
'ao': 2,
'au': 8,
'ea': 15,
'ee': 15,
'ei': 1,
'eo': 5,
'eu': 2,
'ia': 7,
'ie': 10,
'ii': 0,
'io': 3,
'iu': 0,
'oa': 2,
'oe': 2,
'oi': 3,
'oo': 11,
'ou': 7,
'ua': 2,
'ue': 9,
'ui': 2,
'uo': 0,
'uu': 0}
CodePudding user response:
Another option is to simply iterate over the the vowel pairs and count the number of occurrences in word_lst
of each pair. Note that for the current task, you don't need to create an explicit list: vowel_perm
either, simply iterate over the map object:
out = pd.Series({pair: sum(True for w in word_lst if pair in w)
for pair in map(''.join, itertools.product(vowels,vowels))})
On my machine, a benchmark says:
>>> %timeit out = pd.Series({pair: sum(True for w in word_lst if pair in w) for pair in map(''.join, itertools.product(vowels,vowels))})
492 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit vowel_perm = [x[0] x[1] for x in itertools.product(vowels,vowels)]; out = dfwd['word'].apply(wide_contains).sum()
40.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
CodePudding user response:
You can use numpy.char.find
to search an array of string for a substring:
word_lst = np.array([
'gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid',
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute',
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch',
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge',
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali',
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie',
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie',
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch'
]).astype("U")
dfwd = pd.Series({
perm: (np.char.find(word_lst, perm) != -1).sum()
for perm in ["".join(p) for p in permutations("aoeui", 2)]
})