I'm trying to do a pretty simple task in Python that I have already done in Julia. It consists of taking an array of multiple 3d elements and making a dictionary of indexes of unique values from that list (note the list is 6,000,000 elements long). I have done this in Julia and it is reasonably fast (6 seconds) - here is the code:
function unique_ids(itr)
#create dictionnary where keys have type of whatever itr is
d = Dict{eltype(itr), Vector}()
#iterate through values in itr
for (index,val) in enumerate(itr)
#check if the dictionary
if haskey(d, val)
push!(d[val],index)
else
#add value of itr if its not in v yet
d[val] = [index]
end
end
return collect(values(d))
end
So far so good. However, when I try doing this in Python, it seems to take forever, so long that I can't even tell you how long. So the question is, am I doing something dumb here, or is this just the reality of the differences between these two languages? Here is my Python code, a translation of the Julia code.
def unique_ids(row_list):
d = {}
for (index,val) in tqdm(enumerate(row_list)):
if str(val) in d:
d[str(val)].extend([index])
else:
d[str(val)] = [index]
return list(d.values())
Note that I use strings for the keys of the dict in Python as it is not possible to have an array as a key in Python.
CodePudding user response:
Not really, proper Python code that uses Python built-in objects takes roughly 5 seconds ... or 6 seconds if you increase rand_max
import random
from collections import defaultdict
input_list_len = 6_000_000
rand_max = 100
row_list = []
for _ in range(input_list_len):
row_list.append(tuple(random.randint(0,rand_max) for x in range(3)))
def unique_ids(row_list):
d = defaultdict(list)
for (index,val) in enumerate(row_list):
d[val].append(index)
return list(d.values())
import time
t1 = time.time()
output = unique_ids(row_list)
t2 = time.time()
print(f"total time = {t2-t1}")
The thing is, Julia is a JIT compiled language, so a compiler can optimize the code for speed by knowing the objects' types, Python is an interpreted language, you have to know the correct functions and containers to use to make the code fast by running in C, otherwise your code will be running in a slow op-code interpreter.
Edit: thanks to @qwr for the reminder, this is CPython that we are benchmarking, this implementation of python is heavily dependent on using C code for speedup, which makes it the most extendible version of python, other versions of python that use a JIT compiler like PyPy ( and Julia) are not as extendible as CPython, but have faster code by compiling their op-code to machine code at runtime, while CPython trades the speed of python op-code for extendibility with faster C code.
Edit2: the real power of CPython is that, assuming that you app is based around this operation, you can write this operation in C/C and add it to CPython very easily to be faster than julia at this operation most of the time, you also get access to many python modules that are also written in C/C to speed up your development, in the end CPython is usually used as a glue around a wide range of C/C modules, basically as a high level glue.
CodePudding user response:
I think the bottom line is this type of function can definitely run in python in less than 6 seconds.
I think the main issue as many people have pointed out is tqdm and using a string in the dictionary. If you take these out it gets a lot faster. Interesting, swapping to the collections.defaultdict
really helps as well. If I get a moment I will write the equivalent function in C using the python C API and I will append those results.
I have included the test code below, for 1 test iteration with 6,000,000 I get 4.9 secs best in python; this is with an i9-9980XE processor, I don't know what your test is on. Note the key is critical, if I swap the tuple to be an int the time is 1.48 secs, so how the input data is presented makes a huge difference.
Method | Time | Relative |
---|---|---|
Original | 16.01739319995977 | 10.804717667865178 |
Removing tqdm | 12.82462279999163 | 8.650997501336496 |
Removing to string and just using tuple | 5.3935559000237845 | 3.6382854561971936 |
Removing double dictionary lookup | 4.682285099988803 | 3.1584895191283664 |
Using collections defaultdict | 4.493273599946406 | 3.0309896277014063 |
Using defaultdict with int key | 1.4824443999677896 | 1.0 |
Looking over a smaller dataset (1,000,000), but more iterations (100) I get a closer gap:
Method | Time | Relative |
---|---|---|
Original | 253.63316379999742 | 4.078280268213264 |
Removing tqdm | 195.89607029996114 | 3.1498999032904 |
Removing to string and just using tuple | 69.18050129996845 | 1.1123840004584065 |
Removing double dictionary lookup | 68.65376710001146 | 1.1039144073575153 |
Using collections defaultdict | 62.19120489998022 | 1.0 |
The Julia benchmarks do look very interesting. I haven't had a chance to look at these in detail but I do wonder how much the python benchmarks leverage libraries like numpy as scipy.
With this test code:
from tqdm import tqdm
import timeit, random
from collections import defaultdict
random.seed(1)
rand_max = 100
data_size = 1000000
iterations = 100
data = [tuple(random.randint(0, rand_max) for i in range(3)) for j in range(data_size)]
data2 = [t[0] for t in data]
def method0(row_list):
d = {}
for (index,val) in tqdm(enumerate(row_list)):
if str(val) in d:
d[str(val)].extend([index])
else:
d[str(val)] = [index]
return list(d.values())
def method1(row_list):
d = {}
for index,val in enumerate(row_list):
if str(val) in d:
d[str(val)].extend([index])
else:
d[str(val)] = [index]
return list(d.values())
def method2(row_list):
d = {}
for index, val in enumerate(row_list):
if val in d:
d[val].extend([index])
else:
d[val] = [index]
return list(d.values())
def method3(row_list):
d = {}
for index, val in enumerate(row_list):
if (l := d.get(val)):
l.append(index)
else:
d[val] = [index]
return d.values()
def method4(row_list):
d = defaultdict(list)
for (index,val) in enumerate(row_list):
d[val].append(index)
return list(d.values())
assert (m0 := method0(data)) == method1(data)
assert m0 == method2(data)
assert (m0 := sorted(m0)) == sorted(method3(data))
assert m0 == sorted(method4(data))
t0 = timeit.timeit(lambda: method0(data), number=iterations)
t1 = timeit.timeit(lambda: method1(data), number=iterations)
t2 = timeit.timeit(lambda: method2(data), number=iterations)
t3 = timeit.timeit(lambda: method3(data), number=iterations)
t4 = timeit.timeit(lambda: method4(data), number=iterations)
tmin = min((t0, t1, t2, t3, t4))
print(f'| Method | Time | Relative |')
print(f'|------------------ |----------------------|')
print(f'| Original | {t0} | {t0 / tmin} |')
print(f'| Removing tqdm | {t1} | {t1 / tmin} |')
print(f'| Removing to string and just using tuple | {t2} | {t2 / tmin} |')
print(f'| Removing double dictionary lookup | {t3} | {t3 / tmin} |')
print(f'| Using collections defaultdict | {t4} | {t4 / tmin} |')