I am trying to efficiently calculate the Interquartile Range, IQR, of some variable-length histogram data. I have the data in a list of lists. Each inner list is an individual histogram. Most of these histograms have a length of 100, but the length can vary between 50 -- 150 ints long.
Sample data:
list_of_hists = [
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
[10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28]
]
I am currently calculating the IQRs using a simple for loop:
list_of_iqrs = []
for hist in list_of_hists:
iqr = np.quantile(hist, 0.75, interpolation="linear") - np.quantile(
hist, 0.25, interpolation="linear"
)
list_of_iqrs.append(iqr)
Expected results for the above data:
list_of_iqrs = [10.0, 10.5, 11.5, 2.0]
Given that this list of hists is ~10**6 elements long, I am hoping to find a way to do this using an array calculation. Unfortunately, when I try to turn this into an array, I just get an array of lists:
array([
list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42]),
list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42]),
list([13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43]),
list([10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28])
])
and the quantile calculation doesn't work like I would expect.
How can I turn this list of hists into an array and find the IQR?
Edit: Another solution seems to be to append onto each hist and make them all the same length, and then turn that list of hists into an array:
list_of_hists_ = [hist [None]*(len(max(list_of_hists, key=len))-len(hist)) for hist in list_of_hists]
np.array(list_of_hists_)
but this is very slow. Maybe I've already found the fastest way to do this?
CodePudding user response:
Since the data is sorted (histogram) you can utilize this characteristic for calculating the IQR in a more efficient way. It's enough to get the difference between the median of each half separately.
import numpy as np
from time import time
list_of_hists = [
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
[10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
[10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 42],
[13, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 22, 22, 22, 23, 24, 24, 24, 24, 25, 25, 25, 25, 27, 28, 28, 29, 30, 30, 30, 31, 31, 31, 31, 32, 32, 32, 32, 33, 33, 34, 34, 35, 35, 35, 35, 35, 36, 36, 42, 43, 43, 43, 43],
[10, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 19, 25, 28],
]
list_of_hists = np.array(list_of_hists)
def sortedIQR(data):
pivot = len(data)//2
# First quartile (Q1)
Q1 = np.median(data[:pivot])
# Third quartile (Q3)
Q3 = np.median(data[pivot:])
# Interquartile range (IQR)
IQR = Q3 - Q1
return IQR
def simpleIQR(hist):
iqr = np.quantile(hist, 0.75, interpolation="linear") - np.quantile(
hist, 0.25, interpolation="linear"
)
return iqr
start = time()
answers = []
for idx, item in enumerate(list_of_hists):
answers.append(simpleIQR(item))
end = time()
print('Elapsed Time for Simple IQR: ', round(end-start, 5))
print(answers)
answers = []
start = time()
for idx, item in enumerate(list_of_hists):
answers.append(sortedIQR(item))
end = time()
print('Elapsed Time for Sorted IQR: ', round(end-start, 5))
print(answers)
output:
Elapsed Time for Simple IQR: 0.004
[10.0, 10.5, 11.5, 2.0, 10.0, 10.5, 11.5, 2.0, 10.0, 10.5, 11.5, 2.0]
Elapsed Time for Sorted IQR: 0.001
[10.0, 10.5, 12.0, 2.0, 10.0, 10.5, 12.0, 2.0, 10.0, 10.5, 12.0, 2.0]