Home > other >  Pandas column search on a Dataframe,, complexity and optimisation for search function on Pandas
Pandas column search on a Dataframe,, complexity and optimisation for search function on Pandas

Time:10-21

Here i'm looking for the line index having a given value on one column called "word", note that df is a dataframe having many columns but sorted alphabetically on the column "word".
here is my function

def getIndex(df,givenword):
    index=df[df['word']==givenword].index.values[0]
    return index

The problem is that df is quite big (around 10000k rows), this function is called in a loop of 30000 givenword. The search performance is awful,would you suggest a better implementation to optimize my function.

CodePudding user response:

I suggest idxmax:

def getIndex(df,givenword):
    index = df[df['word'] == givenword].idxmax()
    return index

idxmax would give the index of the first occurrence of the maximum value here, in this case True.

CodePudding user response:

If the the DataFrame is sorted alphabetically use searchsorted, see the toy example below:

import pandas as pd

ser = pd.Series(["fox", "hello", "jump", "world"])
res = ser.searchsorted("jump")
print(res)

Output

2

You can even pass the whole list of words, as below:

res = ser.searchsorted(["fox", "hello"])
print(res)

Output

[0 1]

The time complexity of this approach is O(mlogn), where m is the amount of words being searched and n is the size of the DataFrame. Note: you need to check that the words do correspond to the index, as searchsorted search the position where the word needs to be inserted.

An alternative would be to create a dictionary mapping words to first occurrence, and then search for the words:

lookup = {key: value for key, value in zip(ser.values[::-1], ser.index[::-1])}
words = ["jump", "world"]

res = [lookup[word] for word in words]
print(res)

Output

[2, 3]

The time complexity of this approach is O(n m).

  • Related