I have a series of pandas, and I want filter it by checking if strings in columns are substring of another string.
For example,
sentence = "hello world"
words = pd.Series(["hello", "wo", "d", "panda"])
And then,I want to get series (which is series of substring "hello world") as below.
filtered_words = pd.Series(["hello", "wo", "d"])
Maybe there are some ways like "apply", or something, but it doesn't look like vectorized things.
How can I make it?
CodePudding user response:
There is no vectorized way to do this, you'll need to loop.
Whether you're using apply
or map
this will do exactly the same and loop. A slightly faster way of to use a pure python list comprehension.
filtered_words = words[[x in sentence for x in words]]
Below is a timing on 400k rows
%%timeit
w.map(lambda x : x in sentence)
103 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
w.apply(lambda x : x in sentence )
121 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
[x in sentence for x in w]
85.8 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
NB. apply
is sometimes faster than map
(or within the margin of error), but the pure python if always faster by ~15-25%
CodePudding user response:
Let us try
out = words[words.map(lambda x : x in sentence )]
0 hello
1 wo
2 d
dtype: object
CodePudding user response:
How about:
out = words[words.apply(lambda x: x in sentence)]
But list comprehension is still pretty fast:
out = [w for w in words if w in sentence]
CodePudding user response:
You could use list comprehension
filtered_words = pd.Series([word for word in words if word in sentence])
Or as you say, you could use apply
word_mask = words.apply(lambda word: word in sentence)
found_words = word_mask*words
filtered_words = found_words[found_words.astype(bool)]