Home > Software engineering >  How to speed up the loop in a dataframe
How to speed up the loop in a dataframe

Time:12-30

I would like to speed up my loop because I have to do it on 900 000 data.

To simplify i show you a sample.

I would like to add an column name 'Count' which count the number of times where score was under target score for the same player. But for each row the target change.

Input :

   index Nom player  score  target score
0      0      felix      3            10
1      1      felix      8             7
2      2       theo      4             5
3      3    patrick     12             6
4      4     sophie      7             6
5      5     sophie      3             6
6      6      felix      2             4
7      7      felix      2             2
8      8      felix      2             3

Result :

   index Nom player  score  target score  Count
0      0      felix      3            10      5
1      1      felix      8             7      4
2      2       theo      4             5      1
3      3    patrick     12             6      0
4      4     sophie      7             6      1
5      5     sophie      3             6      1
6      6      felix      2             4      4
7      7      felix      2             2      0
8      8      felix      2             3      3

Below the code i current use but is it possible to speed up ? I saw some articles about vectorization is it possible to apply on my calcul ? If yes how to do it

df2 = df.copy()
df2['Count']= [np.count_nonzero((df.values[:,1] == row[2] )& ( df.values[:,2] < row[4]) )    for row in df.itertuples()]
print(df2)

CodePudding user response:

Jérôme Richard's insights for a O(n log n) solution can be translated to pandas. The speed up depends on the number and size of the groups in the dataframe.

df2 = df.copy()
gr = df2.groupby('Nom player')
lookup = gr.score.apply(np.sort).to_dict()
df2['count'] = gr.apply(
    lambda x: pd.Series(
        np.searchsorted(lookup[x.name], x['target score']),
        index=x.index)
    ).droplevel(0)
print(df2)

Output

   index Nom player  score  target score  count
0      0      felix      3            10      5
1      1      felix      8             7      4
2      2       theo      4             5      1
3      3    patrick     12             6      0
4      4     sophie      7             6      1
5      5     sophie      3             6      1
6      6      felix      2             4      4
7      7      felix      2             2      0
8      8      felix      2             3      3

CodePudding user response:

There are two main issues in the current code. CPython string objects are slow, especially string comparison. Moreover, the current algorithm has a quadratic complexity: it compares all rows matching with the current one, for each row. The later is the biggest issue for large dataframes.


Implementation

The first thing to do is to replace the string comparison with something faster. Strings objects can be converted to native string using np.array. Then, the unique strings can be extracted as well as their location using np.unique. This basically help us to replace the string matching problem with an integer matching problem. Comparing native integer is generally significantly faster mainly because the processor like that and Numpy can use efficient SIMD instructions so to compare integers. Here is how to convert the string column to label indices:

# 0.065 ms
labels, labelIds = np.unique(np.array(df.values[:,1], dtype='U'), return_inverse=True)

Now, we can group-by the score by label (player names) efficiently. The thing is Numpy does not provide any group-by function. While this is possible to do that efficiently using multiple np.argsort, a basic pure-Python dict-based approach turns out to be pretty fast in practice. Here is the code grouping scores by label and sorting the set of score associated to each label (useful for the next step):

# 0.014 ms

from collections import defaultdict

scoreByGroups = defaultdict(lambda: [])

labelIdsList = labelIds.tolist()
scoresList = df['score'].tolist()
targetScoresList = df['target score'].tolist()

for labelId, score in zip(labelIdsList, scoresList):
    scoreByGroups[labelId].append(score)

for labelId, scoreGroup in scoreByGroups.items():
    scoreByGroups[labelId] = np.sort(np.array(scoreGroup, np.int32))

scoreByGroups can now be used to efficiently find the number of scores smaller than a given one for a given label. One just need to read scoreByGroups[label] (constant time) and then do a binary search on the resulting array (O(log n)). Here is how:

# 0.014 ms
counts = [np.searchsorted(scoreByGroups[labelId], score)
          for labelId, score in zip(labelIdsList, targetScoresList)]

# Copies are slow, but adding a new column seems even slower
# 0.212 ms
df2 = df.copy()
df2['Count'] = np.fromiter(counts, np.int32)

Results

The resulting code takes 0.305 ms on my machine on the example input while the initial code takes 1.35 ms. This means this implementation is about 4.5 times faster. 2/3 of the time is unfortunately spent in the creation of the new dataframe with the new column. Note that the provided code should be much faster than the initial code on large dataframe thanks to a O(n log n) complexity instead of a O(n²) one.

CodePudding user response:

You can try:

df['Count'] = df.groupby("Nom player").apply(
    lambda x: pd.Series((sum(x["score"] < s) for s in x["target score"]), index=x.index)
).droplevel(0)

print(df)

Prints:

   index Nom player  score  target score  Count
0      0      felix      3            10      5
1      1      felix      8             7      4
2      2       theo      4             5      1
3      3    patrick     12             6      0
4      4     sophie      7             6      1
5      5     sophie      3             6      1
6      6      felix      2             4      4
7      7      felix      2             2      0
8      8      felix      2             3      3

EDIT: Quick benchmark:

from timeit import timeit


def add_count1(df):
    df["Count"] = (
        df.groupby("Nom player")
        .apply(
            lambda x: pd.Series(
                ((x["score"] < s).sum() for s in x["target score"]), index=x.index
            )
        )
        .droplevel(0)
    )


def add_count2(df):
    df["Count"] = [
        np.count_nonzero((df.values[:, 1] == row[2]) & (df.values[:, 2] < row[4]))
        for row in df.itertuples()
    ]


def add_count3(df):
    gr = df.groupby('Nom player')
    lookup = gr.score.apply(lambda x: np.sort(np.array(x))).to_dict()
    df['count'] = gr.apply(
        lambda x: pd.Series(
            np.searchsorted(lookup[x.name], x['target score']),
            index=x.index)
        ).droplevel(0)


df = pd.concat([df] * 1000).reset_index(drop=True)  # DataFrame of len=9000


t1 = timeit("add_count1(x)", "x=df.copy()", number=1, globals=globals())
t2 = timeit("add_count2(x)", "x=df.copy()", number=1, globals=globals())
t3 = timeit("add_count3(x)", "x=df.copy()", number=1, globals=globals())

print(t1)
print(t2)
print(t3)

Prints on my computer:

0.7540620159707032
6.63946107000811
0.004106967011466622

So my answer should be faster than the original, but Michael Szczesny's answer is the fastest.

  • Related