Here i'm looking for the line index having a given value on one column called "word", note that df is a dataframe having many columns but sorted alphabetically on the column "word".
here is my function
def getIndex(df,givenword):
index=df[df['word']==givenword].index.values[0]
return index
The problem is that df is quite big (around 10000k rows), this function is called in a loop of 30000 givenword. The search performance is awful,would you suggest a better implementation to optimize my function.
CodePudding user response:
I suggest idxmax
:
def getIndex(df,givenword):
index = df[df['word'] == givenword].idxmax()
return index
idxmax
would give the index of the first occurrence of the maximum value here, in this case True
.
CodePudding user response:
If the the DataFrame is sorted alphabetically use searchsorted, see the toy example below:
import pandas as pd
ser = pd.Series(["fox", "hello", "jump", "world"])
res = ser.searchsorted("jump")
print(res)
Output
2
You can even pass the whole list of words, as below:
res = ser.searchsorted(["fox", "hello"])
print(res)
Output
[0 1]
The time complexity of this approach is O(mlogn), where m is the amount of words being searched and n is the size of the DataFrame. Note: you need to check that the words do correspond to the index, as searchsorted
search the position where the word needs to be inserted.
An alternative would be to create a dictionary mapping words to first occurrence, and then search for the words:
lookup = {key: value for key, value in zip(ser.values[::-1], ser.index[::-1])}
words = ["jump", "world"]
res = [lookup[word] for word in words]
print(res)
Output
[2, 3]
The time complexity of this approach is O(n m).