Splitting an array of arbitrary size elements-CodePudding

I have a list of sentences which I read off an article, and I wish to group them in evenly sized (as much as possible) buckets of max N characters, with the extra complication of keeping the buckets roughly the same size, and the sentences unsplit.

I am using python and the naive approach I have in mind is to:

Iterate over the sentences and fill each bucket until it's full.
Go over the buckets iteratively and try to even them.

The second part is what I'm afraid isn't going to be as straightforward, and I wonder if there's a quick/smart way to achieve that?

Full example data would be too large here, but in a simpler form: [[Sentence 1 with 30 words], [Sentence 1 with 3words][[Sentence 1 with 15 words][[Sentence 1 with 10 words], ...]

A bucket would be [sent1, sent 2...] of total character length the sum of the sentences.

And I would like to group into X buckets, were total sentences length in each bucket will be as much as evenly spread as possible, and must not pass a certain thredhold.

CodePudding user response：

So say you have to fit a large number of items (let's say 327 items) into a max bucket size (let's say 12 items per bucket) right? You want every bucket size to be the same length after filling but bucket length not to exceed max bucket size

Get the greatest common denominator of the max bucket size and the number of items: Then that's how many items you can fit in each bucket:

from math import gcd
from random import shuffle

no_items = 327
max_bucket_size = 12

items_per_bucket = gcd(len(items), max_bucket_size)

list(range(327))
shuffle(items)

print(list((zip(*[iter(items)]*items_per_bucket))))

[(38, 273, 165),
 (212, 259, 197),
 (4, 181, 43),
 (162, 137, 23),
 (42, 146, 315),
 (99, 240, 292),
 (96, 182, 89),
 (102, 164, 73),
 (153, 248, 5),
 (84, 48, 81),
 (148, 214, 321),
 (130, 44, 40),
 (210, 312, 202),
 (127, 126, 31),
 (83, 9, 7),
 (284, 57, 194),
 (303, 109, 95),
 (223, 238, 227),
 (104, 10, 94),
 (78, 155, 224),
 (125, 271, 279),
 (123, 189, 135),
 (249, 290, 243),
 (141, 139, 119),
 (59, 311, 299),
 (186, 289, 306),
 (174, 241, 278),
 (75, 11, 286),
 (62, 228, 250),
 (169, 16, 314),
 (281, 63, 263),
 (265, 132, 108),
 (287, 129, 138),
 (177, 103, 142),
 (22, 230, 173),
 (203, 319, 239),
 (72, 180, 64),
 (47, 205, 0),
 (178, 255, 234),
 (274, 71, 117),
 (305, 316, 6),
 (301, 80, 118),
 (200, 86, 45),
 (163, 258, 147),
 (27, 232, 209),
 (183, 17, 157),
 (56, 302, 121),
 (156, 225, 3),
 (26, 282, 2),
 (150, 300, 53),
 (76, 207, 215),
 (152, 206, 74),
 (198, 28, 247),
 (51, 87, 12),
 (98, 19, 20),
 (172, 116, 298),
 (270, 195, 41),
 (188, 18, 256),
 (233, 226, 187),
 (134, 262, 309),
 (179, 190, 237),
 (317, 113, 69),
 (313, 34, 199),
 (35, 253, 159),
 (269, 308, 213),
 (295, 168, 285),
 (176, 54, 46),
 (242, 140, 100),
 (288, 191, 244),
 (219, 280, 193),
 (260, 307, 235),
 (320, 196, 264),
 (115, 277, 60),
 (33, 110, 158),
 (55, 24, 166),
 (236, 204, 293),
 (88, 120, 101),
 (32, 322, 149),
 (296, 97, 68),
 (79, 52, 14),
 (15, 29, 222),
 (58, 8, 201),
 (128, 272, 275),
 (112, 208, 283),
 (246, 220, 254),
 (111, 325, 324),
 (114, 304, 124),
 (131, 92, 39),
 (21, 1, 67),
 (323, 184, 245),
 (160, 171, 151),
 (77, 192, 268),
 (266, 136, 133),
 (291, 65, 30),
 (252, 276, 37),
 (49, 70, 50),
 (267, 13, 217),
 (170, 82, 211),
 (221, 216, 144),
 (310, 93, 25),
 (326, 106, 218),
 (318, 66, 294),
 (167, 154, 36),
 (185, 161, 107),
 (143, 145, 261),
 (61, 231, 297),
 (85, 257, 229),
 (175, 105, 90),
 (122, 251, 91)]
>

CodePudding user response：

I havent't considered the space-separated input & bucket contain sentences .. You can replicate the same approach on you given example .. you can't evenly distribute a word without division eg: sentence :"ab here" in this case "here" must be divided into two buckets.

import math
def getDivisors(n) :
    factors = []
    i = 1
    while i <= math.sqrt(n):
        if (n % i == 0) :
            if (n / i == i) :
                factors.append(i)
            else :
                factors.append(i)
                factors.append(n/i)
        i = i   1
    return factors
    
def binarySearch(nums, target , size):
    left, right = 0, len(nums) - 1
    prev = 0
    while left <= right:
        mid = (left   right) // 2
        if target == math.ceil(size/nums[mid]):
            return mid
        elif target < math.ceil(size/nums[mid]):
            prev = mid
            right = mid - 1
        else:
            prev = mid
            left = mid   1
    return prev
maxCharInBucket = 4
myStr = "abcheroher"
size = len(myStr)
allFactors = getDivisors(size)
val = binarySearch(allFactors, maxCharInBucket  , size)
pnt = 1
if(val<0):
    val = 0
pnt = max(pnt,allFactors[val])
buckets = []
i = 0
while i < size:
    tmp = []
    j = i
    for j in range(i,i pnt):
        tmp.append(myStr[j])
    
    i = i pnt
    buckets.append(tmp)
print(buckets)

Output: [['a', 'b'], ['c', 'h'], ['e', 'r'], ['o', 'h'], ['e', 'r']]