Is there any vectorized way of check string of column is substring in Pandas?-CodePudding

I have a series of pandas, and I want filter it by checking if strings in columns are substring of another string.

For example,

sentence = "hello world"
words = pd.Series(["hello", "wo", "d", "panda"])

And then,I want to get series (which is series of substring "hello world") as below.

filtered_words = pd.Series(["hello", "wo", "d"])

Maybe there are some ways like "apply", or something, but it doesn't look like vectorized things.

How can I make it?

CodePudding user response：

There is no vectorized way to do this, you'll need to loop.

Whether you're using apply or map this will do exactly the same and loop. A slightly faster way of to use a pure python list comprehension.

filtered_words = words[[x in sentence for x in words]]

Below is a timing on 400k rows

%%timeit
w.map(lambda x : x in sentence)
103 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
w.apply(lambda x : x in sentence )
121 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
[x in sentence for x in w]
85.8 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

NB. apply is sometimes faster than map (or within the margin of error), but the pure python if always faster by ~15-25%

CodePudding user response：

Let us try

out = words[words.map(lambda x : x in sentence )]
0    hello
1       wo
2        d
dtype: object

CodePudding user response：

How about:

out = words[words.apply(lambda x: x in sentence)]

But list comprehension is still pretty fast:

out = [w for w in words if w in sentence]

CodePudding user response：

You could use list comprehension

filtered_words = pd.Series([word for word in words if word in sentence])

Or as you say, you could use apply

word_mask = words.apply(lambda word: word in sentence)
found_words = word_mask*words
filtered_words = found_words[found_words.astype(bool)]