Home > Software design >  Efficient pythonic way to compute the difference between the max and min elements of each tuple in a
Efficient pythonic way to compute the difference between the max and min elements of each tuple in a

Time:01-31

Using the following code, I am trying to calculate the difference between the max and min elements of each tuple in a large list of tuples and store the results in a list. However, the code runs for long time and then OS kills it because it consumes huge amount of RAM. The large list is generated by choosing n numbers from a list, basically all possible ways as shown in the snippet below. I think the issue lies exactly there: itertools.combinations, which tries to store a massive list in memory.

I need the sum of the diffs arising from each combination actually, that's why I thought first I would get the diffs in a list and then call sum.

import itertools

n = 40

lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]

result = [max(x)-min(x) for x in itertools.combinations(lst, n)]

It will be a great learning experience for me if someone provides a hint about tackling this issue.

CodePudding user response:

Using the approach outlined in the comments on @chepner's answer.

The print statements in the loop show the code doing what I intend, but I haven't independently verified the overall answer is correct.

import itertools
import math

n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]
slst = sorted(lst)

ans = 0
for i, j in itertools.combinations(range(len(slst)), 2):
  # i and j are candidate *indexes* into slst (not values):
  # we'll count how many combinations have minimum slst[i] and maximum slst[j]
  
  if j < i n-1:
    # In this case there can't be an n-item combination
    # whose minimum is slst[i] and maximum is slst[j]:
    # there aren't enough items with values in between
    continue

  # How many n-item combinations have minimum slst[i] and maximum slst[j]?
  # It's the number of ways we can pick the (n-2) other members of the combination
  # from the (j-i-1) values between i and j in slst.
  n_comb = math.comb(j-i-1, n-2)
  print(f"{n_comb} combinations with minimum {slst[i]} (index {i}) and maximum {slst[j]} (index {j})")

  # Each of these combinations contributes slst[j] - slst[i] to the sum:
  ans  = n_comb * (slst[j] - slst[i])

print(f"Overall sum of differences: {ans}")

Result:

[omitted the lines for individual pairs of indices]
Overall sum of differences: 9965200498117

CodePudding user response:

If you want the sum, you don't need to store them all in memory at once. Just add them as you compute them.

s = sum(max(x) - min(x) for x in itertools.combinations(lst, n))

It will still take a while to iterate over all the combinations, but this will use a small constant amount of memory.

CodePudding user response:

Same basic idea as in slothrop's answer but implemented differently and much faster. I use an outer loop for how many list numbers shall exist between the max and min numbers:

def Kelly2(lst, n):
    lst = sorted(lst)
    total = 0
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        diffs = sum(b - a for a, b in zip(lst, lst[between 1:]))
        total  = combs * diffs
    return total

From the between numbers, we have to choose n-2, since we want n including min and max. This number of combinations is the same for all min/max pairs with between numbers between them, so we only compute it once. And instead of multiplying it with every max-min difference, sum those differences and multiply them with the combinations only once.

This can be taken further, as both combs and diffs change just a little when we increase between. Here are benchmark times with "50 choose 45" instead of your original "50 choose 40" (so chepner's brute force is still fast enough) up to "200000 choose 160000":

50 choose 45
 5.028 s  chepner
 0.000 s  slothrop
 0.000 s  Kelly
 0.000 s  Kelly2
 0.000 s  Kelly3
 0.000 s  Kelly4
 0.000 s  Kelly5

2000 choose 1600
 4.447 s  slothrop
 0.053 s  Kelly
 0.048 s  Kelly2
 0.039 s  Kelly3
 0.037 s  Kelly4
 0.001 s  Kelly5

10000 choose 8000
 4.030 s  Kelly
 3.798 s  Kelly2
 3.574 s  Kelly3
 3.521 s  Kelly4
 0.008 s  Kelly5

100000 choose 80000
 0.572 s  Kelly5

200000 choose 160000
 2.202 s  Kelly5

Here's that super fast one:

def Kelly5(lst, n):
    lst = sorted(lst)
    total = 0
    diffs = sum(lst[n-1:]) - sum(lst[:-n 1])
    combs = 1
    for between in range(n-2, len(lst)-1):
        total  = combs * diffs
        combs = combs * (between 1) // (between-n 3)
        diffs  = lst[~between-1] - lst[between 1]
    return total

Kelly3 and Kelly4 are intermediate optimizations from Kelly2 to Kelly5, makingbit easier to see how I got there.

Full code (Try it online!):

from time import time
import itertools, math, random
from math import comb

n = 40
lst = [639, 744, 947, 856, 102, 639, 916, 665, 766, 679, 679, 484, 658, 559, 564, 3, 384, 763, 236, 404, 566, 347, 866, 285, 107, 577, 989, 715, 84, 280, 153, 76, 24, 453, 284, 126, 92, 200, 792, 858, 231, 823, 695, 889, 382, 611, 244, 119, 726, 480]


def chepner(lst, n):
    return sum(max(x) - min(x) for x in itertools.combinations(lst, n))


def slothrop(lst, n):
  slst = sorted(lst)
  ans = 0
  for i, j in itertools.combinations(range(len(slst)), 2):
    if j < i n-1:
      continue
    n_comb = math.comb(j-i-1, n-2)
    ans  = n_comb * (slst[j] - slst[i])
  return ans


# My original
def Kelly(lst, n):
    lst = sorted(lst)
    return sum(
        comb(between, n-2) * sum(b - a for a, b in zip(lst, lst[between 1:]))
        for between in range(n-2, len(lst)-1)
    )


# Rewritten with loops for the later optimizations
def Kelly2(lst, n):
    lst = sorted(lst)
    total = 0
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        diffs = sum(b - a for a, b in zip(lst, lst[between 1:]))
        total  = combs * diffs
    return total


# Compute diffs as diff of sums (instead of sum of diffs)
def Kelly3(lst, n):
    lst = sorted(lst)
    total = 0
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        diffs = sum(lst[between 1:]) - sum(lst[:~between])
        total  = combs * diffs
    return total


# Compute diffs by updating (instead of from scratch)
def Kelly4(lst, n):
    lst = sorted(lst)
    total = 0
    diffs = sum(lst[n-1:]) - sum(lst[:-n 1])
    for between in range(n-2, len(lst)-1):
        combs = comb(between, n-2)
        total  = combs * diffs
        diffs  = lst[~between-1] - lst[between 1]
    return total


# Compute combs by updating (instead of from scratch)
def Kelly5(lst, n):
    lst = sorted(lst)
    total = 0
    diffs = sum(lst[n-1:]) - sum(lst[:-n 1])
    combs = 1
    for between in range(n-2, len(lst)-1):
        total  = combs * diffs
        combs = combs * (between 1) // (between-n 3)
        diffs  = lst[~between-1] - lst[between 1]
    return total


funcs = chepner, slothrop, Kelly, Kelly2, Kelly3, Kelly4, Kelly5

#-- Correctness ------------------------------------------

short = lst[:20]
for m in range(2, len(short) 1):
    expect = funcs[0](short, m)
    for f in funcs[1:]:
        result = f(short, m)
        assert result == expect

#-- Speed ------------------------------------------------

# Generate similar larger input data
def gen(N):
    n = N * 8 // 10
    lst = random.choices(range(20 * N), k=N)
    return lst, n

def test(lst, n, funcs):
    print(len(lst), 'choose', n)
    expect = None
    for f in funcs:
        copy = lst[:]
        t = time()
        result = f(copy, n)
        t = time() - t
        print(f'{t:6.3f} s ', f.__name__)
        if expect is None:
            expect = result
        assert result == expect
    print()

test(lst, 45, funcs)
test(*gen(2000), funcs[1:])
test(*gen(10000), funcs[2:])
test(*gen(100000), funcs[-1:])
test(*gen(200000), funcs[-1:])
  • Related