Written in Python before a script that is used to deal with some data of millions of users, one of the requirements is a statistical user's data quantity, in order to speed the process up, I opened the multi-process, but unfortunately, the program run for nearly a week, haven't processed, at this moment, I don't feel right, then began to check the program performance bottlenecks,
For the statistics to multiplicity, I is the user's data in a list, and then use the len (set) (data) to statistics to quantity, at first I thought this amount of data is not large, each user's data is not best, I didn't pay attention to some users will have tens of thousands of data, so consumes a lot of time (in fact, my script where most time consuming because a large amount of data from remote redis blocked for a long time, even connection timeout, I finally adopt the way to divide and rule, each time to take a small amount of data, this greatly improves performance),
To do optimization, I start to look for effective methods, I found that there are a lot of people think that use dictionary efficiency will be higher, namely:
Data_unique={}. Fromkeys (data). The keys ()
Len (data_unique)
So, I did test:
In [1] : import the random
In [2] : data=https://bbs.csdn.net/topics/[random. Randint (0, 1000) for _ In xrange (1000000)]
In [3] : % timeit len (set (data))
10 loops, best of 3:39.7 ms per loop
In [4] : % timeit len ({}) fromkeys (data). The keys ())
10 loops, best of 3:43.5 ms per loop
Use a dictionary and use, therefore, the performance of the collection is almost, may be even slower,
Know that there are efficient in Python libraries, for example use numpy, pandas to process data, its performance is close to the C language, so, we can use numpy and pandas to solve this problem, here I also compare the performance of heavy access to data, the code is as follows:
Import the collections
Import the random as py_random
The import timeit
The import numpy. Random as np_random
The import pandas as pd
DATA_SIZE=10000000
Def py_cal_len () :
Data=https://bbs.csdn.net/topics/[py_random randint (0, 1000) for _ in xrange (DATA_SIZE)]
Len (set (data))
Def pd_cal_len () :
Data=https://bbs.csdn.net/topics/np_random.randint (1000, size=DATA_SIZE)
Data=https://bbs.csdn.net/topics/pd.Series (data)
Data_unique=data. Value_counts ()
Data_unique. Size
Def py_count () :
Data=https://bbs.csdn.net/topics/[py_random randint (0, 1000) for _ in xrange (DATA_SIZE)]
The collections. Counter (data)
Def pd_count () :
Data=https://bbs.csdn.net/topics/np_random.randint (1000, size=DATA_SIZE)
Data=https://bbs.csdn.net/topics/pd.Series (data)
Data. Value_counts ()
# Script starts from here
If __name__=="__main__" :
T1=timeit. The Timer (py_cal_len "()", the setup="from __main__ import py_cal_len")
T2=timeit. The Timer (pd_cal_len "()", the setup="from __main__ import pd_cal_len")
T3=timeit. The Timer (py_count "()", the setup="from __main__ import py_count")
T4=timeit. The Timer (pd_count "()", the setup="from __main__ import pd_count")
Print a t1. Timeit (number=1)
Print t2. Timeit (number=1)
Print t3. Timeit (number=1)
Print t4. Timeit (number=1)
Results:
12.438587904
0.435907125473
14.6431810856
0.258564949036
Using the statistical data of pandas go multiplicity and weight data, its performance is a Python native function more than 10 times,