Is there a fast (vectorized) way to count a list of string contained within another list of strings?-CodePudding

I want to count the number of two character vowel permutations that are contained within a list of 5-letter words. Vowel permutations are like 'aa','ae','ai,...,'ui','uo','uu'.

I successfully get this done using apply() but it is slow. I want to see if there is a fast vectorized way to get this done. I can't think of one.

Here's what I did:

import pandas as pd 
import itertools

vowels = list('aeiou')
vowel_perm = [x[0] x[1] for x in itertools.product(vowels,vowels)]  

def wide_contains(x):
    return pd.Series(data=[c in x for c in vowel_perm], index=vowel_perm) 

dfwd['word'].apply(wide_contains).sum()

aa     1
ae     2
ai    12
ao     2
au     8
ea    15
ee    15
ei     1
eo     5
eu     2
ia     7
ie    10
ii     0
io     3
iu     0
oa     2
oe     2
oi     3
oo    11
ou     7
ua     2
ue     9
ui     2
uo     0
uu     0

The above is the expected output using the following data

word_lst = ['gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch']

dfwd = pd.DataFrame(word_lst, columns=['word'])

CodePudding user response：

Well, if not using Pandas to do this computation at all is alright, it looks like plain old Counter() is 220x faster on my machine for the data given.

from itertools import Counter
import timeit


def timetest(func, name=None):
    name = name or getattr(func, "__name__", None)
    iters, time = timeit.Timer(func).autorange()
    iters_per_sec = iters / time
    print(f"{name=} {iters=} {time=:.3f} {iters_per_sec=:.2f}")


def counter():
    ctr = Counter()
    for word in dfwd['word']:
        for perm in vowel_perm:
            if perm in word:
                ctr[perm]  = 1
    return ctr


timetest(original)
timetest(counter)
print(counter())

outputs

name='original' iters=10 time=0.229 iters_per_sec=43.59
name='counter' iters=2000 time=0.212 iters_per_sec=9434.29
Counter({'ea': 15, 'ee': 15, 'ai': 12, 'oo': 11, 'ie': 10, 'ue': 9, 'au': 8, 'ou': 7, 'ia': 7, 'eo': 5, 'io': 3, 'oi': 3, 'oa': 2, 'ui': 2, 'ao': 2, 'ua': 2, 'eu': 2, 'oe': 2, 'ae': 2, 'ei': 1, 'aa': 1})

CodePudding user response：

How about some dictionary comprehension? It should be faster than using apply

{v: dfwd['word'].str.count(v).sum() for v in vowel_perm}
# 6.9 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

{'aa': 1,
 'ae': 2,
 'ai': 12,
 'ao': 2,
 'au': 8,
 'ea': 15,
 'ee': 15,
 'ei': 1,
 'eo': 5,
 'eu': 2,
 'ia': 7,
 'ie': 10,
 'ii': 0,
 'io': 3,
 'iu': 0,
 'oa': 2,
 'oe': 2,
 'oi': 3,
 'oo': 11,
 'ou': 7,
 'ua': 2,
 'ue': 9,
 'ui': 2,
 'uo': 0,
 'uu': 0}

CodePudding user response：

Another option is to simply iterate over the the vowel pairs and count the number of occurrences in word_lst of each pair. Note that for the current task, you don't need to create an explicit list: vowel_perm either, simply iterate over the map object:

out = pd.Series({pair: sum(True for w in word_lst if pair in w) 
                 for pair in map(''.join, itertools.product(vowels,vowels))})

On my machine, a benchmark says:

>>> %timeit out = pd.Series({pair: sum(True for w in word_lst if pair in w) for pair in map(''.join, itertools.product(vowels,vowels))})
492 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit vowel_perm = [x[0] x[1] for x in itertools.product(vowels,vowels)]; out = dfwd['word'].apply(wide_contains).sum()
40.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

CodePudding user response：

You can use numpy.char.find to search an array of string for a substring:

word_lst = np.array([
    'gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
    'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
    'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
    'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
    'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
    'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
    'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
    'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch'
]).astype("U")

dfwd = pd.Series({
    perm: (np.char.find(word_lst, perm) != -1).sum()
    for perm in ["".join(p) for p in permutations("aoeui", 2)]
})