I would like to speed up my loop because I have to do it on 900 000 data.
To simplify i show you a sample.
I would like to add an column name 'Count' which count the number of times where score was under target score for the same player. But for each row the target change.
Input :
index Nom player score target score
0 0 felix 3 10
1 1 felix 8 7
2 2 theo 4 5
3 3 patrick 12 6
4 4 sophie 7 6
5 5 sophie 3 6
6 6 felix 2 4
7 7 felix 2 2
8 8 felix 2 3
Result :
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
Below the code i current use but is it possible to speed up ? I saw some articles about vectorization is it possible to apply on my calcul ? If yes how to do it
df2 = df.copy()
df2['Count']= [np.count_nonzero((df.values[:,1] == row[2] )& ( df.values[:,2] < row[4]) ) for row in df.itertuples()]
print(df2)
CodePudding user response:
Jérôme Richard's insights for a O(n log n)
solution can be translated to pandas
. The speed up depends on the number and size of the groups in the dataframe.
df2 = df.copy()
gr = df2.groupby('Nom player')
lookup = gr.score.apply(np.sort).to_dict()
df2['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
print(df2)
Output
index Nom player score target score count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
CodePudding user response:
There are two main issues in the current code. CPython string objects are slow, especially string comparison. Moreover, the current algorithm has a quadratic complexity: it compares all rows matching with the current one, for each row. The later is the biggest issue for large dataframes.
Implementation
The first thing to do is to replace the string comparison with something faster. Strings objects can be converted to native string using np.array
. Then, the unique strings can be extracted as well as their location using np.unique
. This basically help us to replace the string matching problem with an integer matching problem. Comparing native integer is generally significantly faster mainly because the processor like that and Numpy can use efficient SIMD instructions so to compare integers. Here is how to convert the string column to label indices:
# 0.065 ms
labels, labelIds = np.unique(np.array(df.values[:,1], dtype='U'), return_inverse=True)
Now, we can group-by the score by label (player names) efficiently. The thing is Numpy does not provide any group-by function. While this is possible to do that efficiently using multiple np.argsort
, a basic pure-Python dict-based approach turns out to be pretty fast in practice. Here is the code grouping scores by label and sorting the set of score associated to each label (useful for the next step):
# 0.014 ms
from collections import defaultdict
scoreByGroups = defaultdict(lambda: [])
labelIdsList = labelIds.tolist()
scoresList = df['score'].tolist()
targetScoresList = df['target score'].tolist()
for labelId, score in zip(labelIdsList, scoresList):
scoreByGroups[labelId].append(score)
for labelId, scoreGroup in scoreByGroups.items():
scoreByGroups[labelId] = np.sort(np.array(scoreGroup, np.int32))
scoreByGroups
can now be used to efficiently find the number of scores smaller than a given one for a given label. One just need to read scoreByGroups[label]
(constant time) and then do a binary search on the resulting array (O(log n)
). Here is how:
# 0.014 ms
counts = [np.searchsorted(scoreByGroups[labelId], score)
for labelId, score in zip(labelIdsList, targetScoresList)]
# Copies are slow, but adding a new column seems even slower
# 0.212 ms
df2 = df.copy()
df2['Count'] = np.fromiter(counts, np.int32)
Results
The resulting code takes 0.305 ms on my machine on the example input while the initial code takes 1.35 ms. This means this implementation is about 4.5 times faster. 2/3 of the time is unfortunately spent in the creation of the new dataframe with the new column. Note that the provided code should be much faster than the initial code on large dataframe thanks to a O(n log n)
complexity instead of a O(n²)
one.
CodePudding user response:
You can try:
df['Count'] = df.groupby("Nom player").apply(
lambda x: pd.Series((sum(x["score"] < s) for s in x["target score"]), index=x.index)
).droplevel(0)
print(df)
Prints:
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
EDIT: Quick benchmark:
from timeit import timeit
def add_count1(df):
df["Count"] = (
df.groupby("Nom player")
.apply(
lambda x: pd.Series(
((x["score"] < s).sum() for s in x["target score"]), index=x.index
)
)
.droplevel(0)
)
def add_count2(df):
df["Count"] = [
np.count_nonzero((df.values[:, 1] == row[2]) & (df.values[:, 2] < row[4]))
for row in df.itertuples()
]
def add_count3(df):
gr = df.groupby('Nom player')
lookup = gr.score.apply(lambda x: np.sort(np.array(x))).to_dict()
df['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
df = pd.concat([df] * 1000).reset_index(drop=True) # DataFrame of len=9000
t1 = timeit("add_count1(x)", "x=df.copy()", number=1, globals=globals())
t2 = timeit("add_count2(x)", "x=df.copy()", number=1, globals=globals())
t3 = timeit("add_count3(x)", "x=df.copy()", number=1, globals=globals())
print(t1)
print(t2)
print(t3)
Prints on my computer:
0.7540620159707032
6.63946107000811
0.004106967011466622
So my answer should be faster than the original, but Michael Szczesny's answer is the fastest.