Home > Net >  Finding quantiles from np.array of different length lists
Finding quantiles from np.array of different length lists

Time:12-15

I am trying to efficiently calculate the Interquartile Range, IQR, of some variable-length histogram data. I have the data in a list of lists. Each inner list is an individual histogram. Most of these histograms have a length of 100, but the length can vary between 50 -- 150 ints long.

Sample data:

list_of_hists = [
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28]
]

I am currently calculating the IQRs using a simple for loop:

list_of_iqrs = []
for hist in list_of_hists:
    iqr = np.quantile(hist, 0.75, interpolation="linear") - np.quantile(
        hist, 0.25, interpolation="linear"
    )
    list_of_iqrs.append(iqr)

Expected results for the above data:

list_of_iqrs = [10.0, 10.5, 11.5, 2.0]

Given that this list of hists is ~10**6 elements long, I am hoping to find a way to do this using an array calculation. Unfortunately, when I try to turn this into an array, I just get an array of lists:

array([
       list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42]),
       list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42]),
       list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43]),
       list([10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28])
])

and the quantile calculation doesn't work like I would expect.

How can I turn this list of hists into an array and find the IQR?

Edit: Another solution seems to be to append onto each hist and make them all the same length, and then turn that list of hists into an array:

list_of_hists_ = [hist   [None]*(len(max(list_of_hists, key=len))-len(hist)) for hist in list_of_hists]
np.array(list_of_hists_)

but this is very slow. Maybe I've already found the fastest way to do this?

CodePudding user response:

Since the data is sorted (histogram) you can utilize this characteristic for calculating the IQR in a more efficient way. It's enough to get the difference between the median of each half separately.

import numpy as np
from time import time  
list_of_hists = [
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
    [13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
    [10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],

]
list_of_hists = np.array(list_of_hists)

def sortedIQR(data):
    pivot = len(data)//2
    # First quartile (Q1)
    Q1 = np.median(data[:pivot])
    # Third quartile (Q3)
    Q3 = np.median(data[pivot:])
    # Interquartile range (IQR)
    IQR = Q3 - Q1
    return IQR

def simpleIQR(hist):
    iqr = np.quantile(hist, 0.75, interpolation="linear") - np.quantile(
        hist, 0.25, interpolation="linear"
    )
    return iqr

start = time()
answers = []
for idx, item in enumerate(list_of_hists):
    answers.append(simpleIQR(item))
end = time()
print('Elapsed Time for Simple IQR: ', round(end-start, 5))
print(answers)
answers = []
start = time()
for idx, item in enumerate(list_of_hists):
    answers.append(sortedIQR(item))
end = time()
print('Elapsed Time for Sorted IQR: ', round(end-start, 5))
print(answers)

output:

Elapsed Time for Simple IQR:  0.004
[10.0, 10.5, 11.5, 2.0, 10.0, 10.5, 11.5, 2.0, 10.0, 10.5, 11.5, 2.0]
Elapsed Time for Sorted IQR:  0.001
[10.0, 10.5, 12.0, 2.0, 10.0, 10.5, 12.0, 2.0, 10.0, 10.5, 12.0, 2.0]
  • Related