I have a set of lists of strings, e.g.
a = ['one', 'two', 'three']
b = ['yes', 'no']
c = ['days', 'months', 'years', 'decades']
and a vocabulary list, e.g.
vocabList = ['yes', 'i', 'am', 'twenty', 'three', 'years', 'old', 'today']
From this I am trying to build a matrix or pandas df where each row represents a distinct string list (i.e. my lists a, b, c, etc.) and each column represents a word from the vocabulary list. The entries show whether a word from vocabList is contained in each of the string lists.
For the example above, it would look like
[[0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0]]
Now I could do this easily using
pd.DataFrame(vocabList).isin(a).astype(int)
and then using a for loop to concatenate the same kind of binary vector for each string list a, b, c. (And then the easy stuff like converting to a matrix, transposing, etc.)
However in practice I have many 100,000s of such string lists and I am wondering if there is a quicker way to do this/use something instead of concatenating by a for loop?
CodePudding user response:
If you don't have too many different words, you could use str.get_dummies
and reindex
:
L = [a, b, c]
df = (pd.Series(map('|'.join, L))
.str.get_dummies()
.reindex(vocabList, axis=1, fill_value=0)
)
Variant using a set to pre-filter the words:
L = [a, b, c]
S = set(vocabList)
df = (pd.Series(('|'.join(S.intersection(l)) for l in L))
.str.get_dummies()
.reindex(vocabList, axis=1, fill_value=0)
)
Output:
yes i am twenty three years old today
0 0 0 0 0 1 0 0 0
1 1 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0
Variant in pure python:
L = [a, b, c]
S = set(vocabList)
[[int(x in S.intersection(l)) for x in vocabList]
for l in L]
Output:
[[0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0]]
Timing on 300k rows (100k repeats of the provided example):
# pandas
1.03 s ± 24.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pure python
1.43 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# set pandas
1.56 s ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
NB. This might give different results with more unique words
Timing on 3M items (3000 rows, 1000 words, 40% chance found in word list):
# pandas set
1.58 s ± 52.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pandas only
2.04 s ± 35.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# pure python
12.8 s ± 549 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)