How to estimate percentiles on streaming data. (Identifying equally sized bins of numbers in a strea-CodePudding

Peer summary: HMGHaly wants to find the locations of equally spaced percentiles on a data stream. The bins HMGHaly is after should therefore contain roughly the same number of data points, and are therefore not expected to have the same distance between the bin boundaries. The size as HMGHaly uses it refers to the number of data points in the bin not the width of the bin.

I have an iterable of numbers which I cannot fully load in memory, and I want to split these numbers into bins of equal size, meaning that if I want to sort all these numbers and split them into for example 10 groups/bins, what is the lowest value and highest value of each bin.

It is quite easy to identify the mean by counting and adding the numbers so far. It is also quite easy to get the minimum and maximum value so far, but this kind of splitting seems challenging.

I have a few ideas:

If I'm not restricted by the memory, I can load all the numbers into a list, sort the list, and then split it into equal sized smaller lists, while easily identifying the boundary values of each small list, but this is not applicable here.
I can try to sort the huge iterable list somehow and then deal with it as a sorted list, but the issue is that I will have to do this for many different values I have to process simultaneously (numbers under each column)
I can identify the running mean and standard deviation, similar to this answer. Then I can split the bins into how many standard deviations or fractions of standard deviations around the mean. However, I tried implementing this answer, but for some reason when I subtracted the standard deviation from the mean, the value was less than the minimum value, so I think there might be an issue with data distribution, maybe skewed towards higher values than lower ones, but at the end of the day using standard deviation didn't help.

So, the question is here as follows:

given an iterable of tens of millions of numbers, and say that we want to split them into N bins (e.g. 10 bins) of equal size, how can we identify the upper-bound value and lower-bound value of each bin, without loading all these numbers in memory

Edit The bin splitting process is as follows, for simple in-memory list sorting/splitting/binning:

import random
list1=[random.randint(0,20) for i in range(100)]
list1.sort()
print("full list:",list1)
n_intervals=10
interval_size=int(len(list1)/n_intervals)
for i0 in range(n_intervals):
  small_list1=list1[interval_size*i0:interval_size*(i0 1)]
  bounds=(small_list1[0],small_list1[-1])
  print("small_list # %s"%i0,  small_list1,"size:",len(small_list1), "bounds:", bounds)

Output

full list: [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 16, 16, 16, 16, 16, 16, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20]
small_list # 0 [0, 0, 0, 1, 1, 1, 1, 2, 2, 2] size: 10 - bounds: (0, 2)
small_list # 1 [2, 2, 2, 2, 3, 3, 3, 3, 4, 4] size: 10 - bounds: (2, 4)
small_list # 2 [4, 5, 5, 5, 5, 5, 5, 5, 5, 6] size: 10 - bounds: (4, 6)
small_list # 3 [6, 6, 6, 6, 7, 7, 7, 7, 7, 7] size: 10 - bounds: (6, 7)
small_list # 4 [7, 8, 8, 8, 8, 8, 8, 8, 8, 9] size: 10 - bounds: (7, 9)
small_list # 5 [9, 9, 9, 10, 10, 10, 10, 11, 11, 11] size: 10 - bounds: (9, 11)
small_list # 6 [11, 12, 12, 12, 12, 12, 12, 13, 13, 14] size: 10 - bounds: (11, 14)
small_list # 7 [14, 14, 14, 14, 14, 15, 15, 16, 16, 16] size: 10 - bounds: (14, 16)
small_list # 8 [16, 16, 16, 16, 17, 17, 17, 18, 18, 18] size: 10 - bounds: (16, 18)
small_list # 9 [19, 19, 19, 19, 19, 19, 19, 20, 20, 20] size: 10 - bounds: (19, 20)

Further edit: To be fully clear, I need something like the following. It is very easy to get the mean, min and max, but the question now is how to define the boundary values that can split all the values into bins of equal size, while calculating them as a stream of running values, without having to store the running values in memory.

import random
random.seed(0)
count0=0
sum0=0
running_min0=None
running_max0=None

def get_bin_boundaries(n_bins=5): #The function I need, it can take any arguments
  return #and return a list of boundary values corresponding to n_bins 1 e.g. [0,3,7,9,11,15]

for i in range(100000000):
  cur_number=random.randint(0,20)
  count0 =1
  sum0 =cur_number
  running_mean0=sum0/count0
  if running_min0==None or running_min0>cur_number:running_min0=cur_number
  if running_max0==None or running_max0<cur_number:running_max0=cur_number
  running_bin_boundaries=get_bin_boundaries() #This is what I need
  #print("cur_number",cur_number,"running_mean0",running_mean0,"running_min0",running_min0,"running_max0",running_max0)

CodePudding user response：

I think you will need to sort the stream and you can achieve this (and I am here assuming you know the number of items in the stream and that your memory can handle at least two bins at a time) by doing the following

store each bin into disk [bin_size = number_of_items_in_stream /number_of_bins]
after the end of the stream load each bin into memory and sort it then store it again into disk while saving the name of the bin and it's min and max values in a data structure that contains these values in addition to the name of each bin.
in the data structure sort the bins names according to their min value.
from step 3 you can identify which bins intersect with each other.
loop over the data structure and load every two intersecting bins into memory and interchange their values with each other so that the two bins won't have any intersecting values at the end.
after step 5 update the min and max values of the two bins in the data structure to be equal to the updated min and max values.
your stream is now sorted.

CodePudding user response：

If you know the expected length of input beforehand, it would be pretty easy if I understand you correctly:

import random
random.seed(0)
count0=0
sum0=0
running_min0=None
running_max0=None
len=100000000

def get_bin_boundaries(n_bins=5): #The function I need, it can take any arguments
  res = []
  i = 0
  while i < len:
    res.append(i)
    i  = int(len/n_bins)
  res.append(len-1)
  return res#and return a list of boundary values corresponding to n_bins 1 e.g. [0,3,7,9,11,15]

for i in range(len):
  cur_number=random.randint(0,20)
  count0 =1
  sum0 =cur_number
  running_mean0=sum0/count0
  if running_min0==None or running_min0>cur_number:running_min0=cur_number
  if running_max0==None or running_max0<cur_number:running_max0=cur_number
  running_bin_boundaries=get_bin_boundaries() #This is what I need

CodePudding user response：

You should use Python with Apache Spark, doing this operation with python only will consume a lot of time and will not be an efficient way.

https://spark.apache.org/

Another way to try pandas if you need to work with python only. https://pandas.pydata.org/