Applying .isin() to individual columns without a for loop-CodePudding

I have a set of lists of strings, e.g.

a = ['one', 'two', 'three']
b = ['yes', 'no']
c = ['days', 'months', 'years', 'decades']

and a vocabulary list, e.g.

vocabList = ['yes', 'i', 'am', 'twenty', 'three', 'years', 'old', 'today']

From this I am trying to build a matrix or pandas df where each row represents a distinct string list (i.e. my lists a, b, c, etc.) and each column represents a word from the vocabulary list. The entries show whether a word from vocabList is contained in each of the string lists.

For the example above, it would look like

[[0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0]]

Now I could do this easily using

pd.DataFrame(vocabList).isin(a).astype(int)

and then using a for loop to concatenate the same kind of binary vector for each string list a, b, c. (And then the easy stuff like converting to a matrix, transposing, etc.)

However in practice I have many 100,000s of such string lists and I am wondering if there is a quicker way to do this/use something instead of concatenating by a for loop?

CodePudding user response：

If you don't have too many different words, you could use str.get_dummies and reindex:

L = [a, b, c]

df = (pd.Series(map('|'.join, L))
        .str.get_dummies()
        .reindex(vocabList, axis=1, fill_value=0)
      )

Variant using a set to pre-filter the words:

L = [a, b, c]

S = set(vocabList)

df = (pd.Series(('|'.join(S.intersection(l)) for l in L))
        .str.get_dummies()
        .reindex(vocabList, axis=1, fill_value=0)
     )

Output:

   yes  i  am  twenty  three  years  old  today
0    0  0   0       0      1      0    0      0
1    1  0   0       0      0      0    0      0
2    0  0   0       0      0      1    0      0

Variant in pure python:

L = [a, b, c]

S = set(vocabList)

[[int(x in S.intersection(l)) for x in vocabList]
 for l in L]

Output:

[[0, 0, 0, 0, 1, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0]]

Timing on 300k rows (100k repeats of the provided example):

# pandas
1.03 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pure python
1.43 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# set   pandas
1.56 s ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

NB. This might give different results with more unique words

Timing on 3M items (3000 rows, 1000 words, 40% chance found in word list):

# pandas   set
1.58 s ± 52.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pandas only
2.04 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pure python
12.8 s ± 549 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)