Home > Enterprise >  Vectorized version of pandas Series.str.find
Vectorized version of pandas Series.str.find

Time:11-30

The Series.str.find() function in pandas seems to take only a single integer for the start location. I have a Series containing strings and an array of start positions, and I want to find the position of a given substring starting from the corresponding position of each element as follows:

a = pd.Series(data=['aaba', 'ababc', 'caaauuab'])
a.str.find('b', start=[0, 1, 2])  # returns a series of NaNs

I can do this using list comprehension:

[s.find('b', pos) for s, pos in zip(a.values, [0, 1, 2])]

Is there a function in numpy or pandas that can do this directly and faster? Also, is there one that can take an array of substrings as well?

CodePudding user response:

I think this is kind of more pythonic way to do so, because you do not have to worry about indexes:

import pandas as pd

def find_from_index(series: pd.Series, to_find: str) -> pd.Series:
    return pd.Series([v.find(to_find, i) for i, v in enumerate(series)])

a = pd.Series(data=['aaba', 'ababc', 'cbaauuab'])
b = find_from_index(a, 'b')

Hope this helps

CodePudding user response:

No there isn't, vectorizing string operations is difficult.

You could think to convert your strings to arrays of characters, but the conversion would be the limiting step. A quick test tells me that it is roughly taking the same time than performing the list comprehension provided in your question. And we haven't even searched for the position yet.

In short, your current approach seems reasonably the most efficient.

  • Related