Using the following code, I am trying to calculate the difference between the max and min elements of each tuple in a large list of tuples and store the results in a list. However, the code runs for long time and then OS kills it because it consumes huge amount of RAM. The large list is generated by choosing n
numbers from a list, basically all possible ways as shown in the snippet below. I think the issue lies exactly there: itertools.combinations, which tries to store a massive list in memory.
I need the sum of the diffs arising from each combination actually, that's why I thought first I would get the diffs in a list and then call sum.
import itertools
n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]
result = [max(x)-min(x) for x in itertools.combinations(lst, n)]
It will be a great learning experience for me if someone provides a hint about tackling this issue.
CodePudding user response:
Using the approach outlined in the comments on @chepner's answer.
The print statements in the loop show the code doing what I intend, but I haven't independently verified the overall answer is correct.
import itertools
import math
n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]
slst = sorted(lst)
ans = 0
for i, j in itertools.combinations(range(len(slst)), 2):
# i and j are candidate *indexes* into slst (not values):
# we'll count how many combinations have minimum slst[i] and maximum slst[j]
if j < i n-1:
# In this case there can't be an n-item combination
# whose minimum is slst[i] and maximum is slst[j]:
# there aren't enough items with values in between
continue
# How many n-item combinations have minimum slst[i] and maximum slst[j]?
# It's the number of ways we can pick the (n-2) other members of the combination
# from the (j-i-1) values between i and j in slst.
n_comb = math.comb(j-i-1, n-2)
print(f"{n_comb} combinations with minimum {slst[i]} (index {i}) and maximum {slst[j]} (index {j})")
# Each of these combinations contributes slst[j] - slst[i] to the sum:
ans = n_comb * (slst[j] - slst[i])
print(f"Overall sum of differences: {ans}")
Result:
[omitted the lines for individual pairs of indices]
Overall sum of differences: 9965200498117
CodePudding user response:
If you want the sum, you don't need to store them all in memory at once. Just add them as you compute them.
s = sum(max(x) - min(x) for x in itertools.combinations(lst, n))
It will still take a while to iterate over all the combinations, but this will use a small constant amount of memory.
CodePudding user response:
Same basic idea as in slothrop's answer but implemented differently and much faster. I use an outer loop for how many list numbers shall exist between the max and min numbers:
def Kelly2(lst, n):
lst = sorted(lst)
total = 0
for between in range(n-2, len(lst)-1):
combs = comb(between, n-2)
diffs = sum(b - a for a, b in zip(lst, lst[between 1:]))
total = combs * diffs
return total
From the between
numbers, we have to choose n-2
, since we want n including min and max. This number of combinations is the same for all min/max pairs with between
numbers between them, so we only compute it once. And instead of multiplying it with every max-min difference, sum those differences and multiply them with the combinations only once.
This can be taken further, as both combs
and diffs
change just a little when we increase between
. Here are benchmark times with "50 choose 45" instead of your original "50 choose 40" (so chepner's brute force is still fast enough) up to "200000 choose 160000":
50 choose 45
5.028 s chepner
0.000 s slothrop
0.000 s Kelly
0.000 s Kelly2
0.000 s Kelly3
0.000 s Kelly4
0.000 s Kelly5
2000 choose 1600
4.447 s slothrop
0.053 s Kelly
0.048 s Kelly2
0.039 s Kelly3
0.037 s Kelly4
0.001 s Kelly5
10000 choose 8000
4.030 s Kelly
3.798 s Kelly2
3.574 s Kelly3
3.521 s Kelly4
0.008 s Kelly5
100000 choose 80000
0.572 s Kelly5
200000 choose 160000
2.202 s Kelly5
Here's that super fast one:
def Kelly5(lst, n):
lst = sorted(lst)
total = 0
diffs = sum(lst[n-1:]) - sum(lst[:-n 1])
combs = 1
for between in range(n-2, len(lst)-1):
total = combs * diffs
combs = combs * (between 1) // (between-n 3)
diffs = lst[~between-1] - lst[between 1]
return total
Kelly3
and Kelly4
are intermediate optimizations from Kelly2
to Kelly5
, makingbit easier to see how I got there.
Full code (Try it online!):
from time import time
import itertools, math, random
from math import comb
n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]
def chepner(lst, n):
return sum(max(x) - min(x) for x in itertools.combinations(lst, n))
def slothrop(lst, n):
slst = sorted(lst)
ans = 0
for i, j in itertools.combinations(range(len(slst)), 2):
if j < i n-1:
continue
n_comb = math.comb(j-i-1, n-2)
ans = n_comb * (slst[j] - slst[i])
return ans
# My original
def Kelly(lst, n):
lst = sorted(lst)
return sum(
comb(between, n-2) * sum(b - a for a, b in zip(lst, lst[between 1:]))
for between in range(n-2, len(lst)-1)
)
# Rewritten with loops for the later optimizations
def Kelly2(lst, n):
lst = sorted(lst)
total = 0
for between in range(n-2, len(lst)-1):
combs = comb(between, n-2)
diffs = sum(b - a for a, b in zip(lst, lst[between 1:]))
total = combs * diffs
return total
# Compute diffs as diff of sums (instead of sum of diffs)
def Kelly3(lst, n):
lst = sorted(lst)
total = 0
for between in range(n-2, len(lst)-1):
combs = comb(between, n-2)
diffs = sum(lst[between 1:]) - sum(lst[:~between])
total = combs * diffs
return total
# Compute diffs by updating (instead of from scratch)
def Kelly4(lst, n):
lst = sorted(lst)
total = 0
diffs = sum(lst[n-1:]) - sum(lst[:-n 1])
for between in range(n-2, len(lst)-1):
combs = comb(between, n-2)
total = combs * diffs
diffs = lst[~between-1] - lst[between 1]
return total
# Compute combs by updating (instead of from scratch)
def Kelly5(lst, n):
lst = sorted(lst)
total = 0
diffs = sum(lst[n-1:]) - sum(lst[:-n 1])
combs = 1
for between in range(n-2, len(lst)-1):
total = combs * diffs
combs = combs * (between 1) // (between-n 3)
diffs = lst[~between-1] - lst[between 1]
return total
funcs = chepner, slothrop, Kelly, Kelly2, Kelly3, Kelly4, Kelly5
#-- Correctness ------------------------------------------
short = lst[:20]
for m in range(2, len(short) 1):
expect = funcs[0](short, m)
for f in funcs[1:]:
result = f(short, m)
assert result == expect
#-- Speed ------------------------------------------------
# Generate similar larger input data
def gen(N):
n = N * 8 // 10
lst = random.choices(range(20 * N), k=N)
return lst, n
def test(lst, n, funcs):
print(len(lst), 'choose', n)
expect = None
for f in funcs:
copy = lst[:]
t = time()
result = f(copy, n)
t = time() - t
print(f'{t:6.3f} s ', f.__name__)
if expect is None:
expect = result
assert result == expect
print()
test(lst, 45, funcs)
test(*gen(2000), funcs[1:])
test(*gen(10000), funcs[2:])
test(*gen(100000), funcs[-1:])
test(*gen(200000), funcs[-1:])