Home > Software design >  How do you divide up a list into chunks which vary according to a normal distribution
How do you divide up a list into chunks which vary according to a normal distribution

Time:02-01

I want to take a list of thousands of items and group them into 12 chunks, where the number of items found in each chunk correspond to a normal distribution (bell curve).

I am looking for output like this:

[
    { 0: ['6355ab76f70c5c59749f2018', '6355c797f70c5c5974a1cb15', '6355d256f70c5c5974a36a6c' ] },
    { 1: ['6355d270f70c5c5974a37356',
 '6355d29bf70c5c5974a3810a',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d371f70c5c5974a3c236',
 '6355d389f70c5c5974a3c828'] },
    ...
]

I want it so the list I pass in is divided so that the most amount of items are grouped in the middle (numbers 4-8 roughly) and that it less items are grouped together as they reach the "edges" of the resulting list (numbers 0-3, and numbers 9-12). But everything of the input list must be exhausted so the items are fully distributed in this way.

I tried to tackle this with numpy but so far I have not been able to get the output I want.

My current code (two different functions):

        
def divide_list_normal(lst):
    normal_dist = np.random.normal(size=len(lst)) # Generate a normal distribution of numbers
    sorted_list = [x for _,x in sorted(zip(normal_dist,lst))] # Sort the list according to the normal distribution
    chunk_size = int(len(lst)/len(normal_dist)) # Divide the list into chunks
    chunks = [sorted_list[i:i chunk_size] for i in range(0, len(sorted_list), chunk_size)]
    return chunks 

def divide_list_normal_define_chunk_size(lst, n):
    normal_dist = np.random.normal(size=len(lst)) # Generate a normal distribution of numbers
    sorted_list = [x for _,x in sorted(zip(normal_dist,lst))] # Sort the list according to the normal distribution
    chunk_size = int(len(lst)/len(normal_dist)) # Divide the list into chunks
    chunks = [sorted_list[i:i chunk_size] for i in range(0, n, chunk_size)]
    return chunks

The output for the first comes out like so:

[['63a8d83336756fd65d455c77'],
 ['6355f7c6f70c5c5974adfbce'],
 ['635629c6f70c5c5974bbab53'],
 ['6355fa8bf70c5c5974aeb70f'],
 ['6355dcd7f70c5c5974a6355c'],
 ['63a96dae36756fd65d549333'],
 ['639245927eeb4e9fd025e397'],
 ['63562463f70c5c5974ba3b5c'],
 ['63a8e04736756fd65d4635cf'],
 ['635629a5f70c5c5974bba1c1'],
 ['6355f74ef70c5c5974addd2c'],...]

The output for the second comes out like so:

[['63aa1a9d36756fd65d7566cf'],
 ['6355ed78f70c5c5974ab6840'],
 ['63a94e1836756fd65d500d5d'],
 ['63a8e23e36756fd65d4667ec'],
 ['63a96c6536756fd65d5463db'],
 ['63d39021d34efb9c0983d64a'],
 ['635627a9f70c5c5974bb1573'],
 ['63b3a4c236756fd65d33750a'],
 ['63562320f70c5c5974b9e50b'],
 ['63aa1aec36756fd65d758676'],
 ['63a9551636756fd65d5111fb'],
 ['63562443f70c5c5974ba31ed']]

Is there a way to divide up a list into chunks which vary according to a normal distribution? If you know how, please share it. Thank you!

CodePudding user response:

This works, although it may be slow depending on your requirements

import numpy as np
from itertools import islice


testList = ['6355d29bf70c5c5974a3810a',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
  '6355d36cf70c5c5974a3c103',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d371f70c5c5974a3c236',
 '6355d389f70c5c5974a3c828']

normal_dist = np.random.normal(size=len(testList),loc=10,scale=4) 
sorted_list = [list(islice(testList, int(x))) for x in normal_dist] 

One thing you have to watch out for is since these are slices of a list, the normal distribution can't be out of bounds, i.e: 0<loc-scale<len(testList)

CodePudding user response:

For each index i, find the CDF of i 0.5 and then subtract the CDF of i-.5. That will be the percentage of the list you should put in that index. For the first index, you'll just have the CDF of i .5, and not subtract the CDF of i-.5, and for the last index, you just have the CDF of i-.5, and subtract that from 1 rather than the CDF of i .5. You'll want the mean to be the middle of your indices, and choose the standard deviation according to how spread out you want it (you'll probably want it somewhere around one fourth the number of indices, but it's up to you).

  • Related