Pandas column search on a Dataframe,, complexity and optimisation for search function on Pandas-CodePudding

Here i'm looking for the line index having a given value on one column called "word", note that df is a dataframe having many columns but sorted alphabetically on the column "word".
here is my function

def getIndex(df,givenword):
    index=df[df['word']==givenword].index.values[0]
    return index

The problem is that df is quite big (around 10000k rows), this function is called in a loop of 30000 givenword. The search performance is awful,would you suggest a better implementation to optimize my function.

CodePudding user response：

I suggest idxmax:

def getIndex(df,givenword):
    index = df[df['word'] == givenword].idxmax()
    return index

idxmax would give the index of the first occurrence of the maximum value here, in this case True.

CodePudding user response：

If the the DataFrame is sorted alphabetically use searchsorted, see the toy example below:

import pandas as pd

ser = pd.Series(["fox", "hello", "jump", "world"])
res = ser.searchsorted("jump")
print(res)

Output

You can even pass the whole list of words, as below:

res = ser.searchsorted(["fox", "hello"])
print(res)

Output

[0 1]

The time complexity of this approach is O(mlogn), where m is the amount of words being searched and n is the size of the DataFrame. Note: you need to check that the words do correspond to the index, as searchsorted search the position where the word needs to be inserted.

An alternative would be to create a dictionary mapping words to first occurrence, and then search for the words:

lookup = {key: value for key, value in zip(ser.values[::-1], ser.index[::-1])}
words = ["jump", "world"]

res = [lookup[word] for word in words]
print(res)

Output

[2, 3]

The time complexity of this approach is O(n m).