Home > Enterprise >  Most efficient way to convert list of values to probability distribution?
Most efficient way to convert list of values to probability distribution?

Time:11-24

I have several lists that can only contain the following values: 0, 0.5, 1, 1.5

I want to efficiently convert each of these lists into probability mass functions. So if a list is as follows: [0.5, 0.5, 1, 1.5], the PMF will look like this: [0, 0.5, 0.25, 0.25].

I need to do this many times (and with very large lists), so avoiding looping will be optimal, if at all possible. What's the most efficient way to make this happen?

Edit: Here's my current system. This feels like a really inefficient/unelegant way to do it:

def get_distribution(samplemodes1):
    
    n, bin_edges = np.histogram(samplemodes1, bins = 9)
    totalcount = np.sum(n)
    bin_probability = n / totalcount
    bins_per_point = np.fmin(np.digitize(samplemodes1, bin_edges), len(bin_edges)-1)
    probability_perpoint = [bin_probability[bins_per_point[i]-1] for i in range(len(samplemodes1))] 
    
    counts = Counter(samplemodes1)
    total = sum(counts.values())
    
    probability_mass = {k:v/total for k,v in counts.items()}
    #print(probability_mass)
    
    key_values = {}
    
    if(0 in probability_mass):
        key_values[0] = probability_mass.get(0)
    else:
        key_values[0] = 0
    if(0.5 in probability_mass):
        key_values[0.5] = probability_mass.get(0.5)
    else:
        key_values[0.5] = 0
    if(1 in probability_mass):
        key_values[1] = probability_mass.get(1)
    else:
        key_values[1] = 0
    if(1.5 in probability_mass):
        key_values[1.5] = probability_mass.get(1.5)  
    else:
        key_values[1.5] = 0
        
        
    distribution = list(key_values.values())
    return distribution

CodePudding user response:

Here are some solution for you to benchmark:

Using collections.Counter

from collections import Counter

bins = [0, 0.5, 1, 1.5]
a = [0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5]
denom = len(a)
counts = Counter(a)
pmf = [counts[bin]/denom for bin in Bins]

NumPy based solution

import numpy as np

bins = [0, 0.5, 1, 1.5]
a = np.array([0.5, 0.5, 1.0, 0.5, 1.0, 1.5, 0.5])
denom = len(a)
pmf = [(a == bin).sum()/denom for bin in bins]

but you can probably do better by using np.bincount() instead.

Further reading on this idea: https://thispointer.com/count-occurrences-of-a-value-in-numpy-array-in-python/

  • Related