Recreate List based on statistics-CodePudding

I am given the following statistics of an array:

length
Minimum
Maximum
Average
Median
Quartiles

I am supposed to recreate a list with more or less the same statistics. I know that the list for which the statistics were calculated is not normally distributed.

My first idea was to just brute-force it by creating a list of random numbers in the given range and hope that one would fit. The benefit of this method is it works. While the downside obviously is the efficiency.

So I'm looking for a more efficient way to solve this problem. Hope that someone can help...

P.S. Currently I only use numpy but I'm not limited to it.

Edit 1: As an example input and output was requested: A input could look as follows:

statistics = {
'length' : 200,
'minimum_value' : 5, 
'maximum_vlaue': 132, 
'mean': 30, 
'median' : 22,
'Q1': 13, 
'Q3': 68
}

The desired output would than look like this:

similar_list = function_to_create_similar_list(statistics)
len(similar_list) # should be roughly 200
min(similar_list) # should be roughly 5
max(similar_list) # should be roughly 132
np.mean(similar_list) # should be roughly 30
np.median(similar_list) # should be roughly 22
np.quantile(similar_list, 0.25) # should be roughly 13
np.quantile(similar_list, 0.75) # should be roughly 68

function_to_create_similar_list is the function I want to define

Edit 2.

My first edit was not enough I'm sorry for that. Here is my current code:


def get_statistics(input_list): 
    output = {}
    output['length'] = len(input_list) 
    output['minimum_value'] = min(input_list) 
    output['maximum_value'] = max(input_list) 
    output['mean'] = np.mean(input_list) 
    output['median'] = np.median(input_list) 
    output['q1'] = np.quantile(input_list, 0.25)
    output['q3'] = np.quantile(input_list, 0.75)
    return output

def recreate_similar_list(statistics, maximum_deviation = 0.1 ): 
    sufficient_list_was_found = False
    while True: 
        candidate_list = [random.uniform(statistics['minimum_value'],statistics['maximum_value']) for _ in range(statistics['length'])]
        candidate_statistics = get_statistics(candidate_list)
        sufficient_list_was_found = True
        for key in statistics.keys(): 
            if np.abs(statistics[key] - candidate_statistics[key]) / statistics[key] > maximum_deviation:
                sufficient_list_was_found = False
                break
        if(sufficient_list_was_found): 
            return candidate_list


example_input_list_1 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,10]
recreated_list_1 = recreate_similar_list(get_statistics(example_input_list_1),0.3)
print(recreated_list_1)
print(get_statistics(recreated_list_1))

example_input_list_2 = [1,1,1,1,3,3,4,4,4,4,4,5,18,19,32,35,35,42,49,68]
recreated_list_2 = recreate_similar_list(get_statistics(example_input_list_2),0.3)
print(recreated_list_2)
print(get_statistics(recreated_list_2))

The first example can find a solution. That was no surprise to me. The second one does not (or not in sufficient time). That also did not surprise me as the lists generated in the recreate_similar_list function are uniformly distributed. Though both examples represent the real task. (Keep in mind that I only get the statistics not the list)

I hope this is now a sufficient example

CodePudding user response：

Your existing solution is interesting, but effectively a bogo-solution. There are direct solutions possible that do not need to rely on random-and-check.

The easy-ish part is to create the array of a correct length, and place all five min/max/quartiles in their appropriate positions (this only works for a somewhat simple interpretation of the problem and has limitations).

The trickier part is to choose "fill values" between the quartiles. These fill values can be identical within one interquartile section, because the only things that matter are the sum and bounds. One fairly straightforward way is linear programming, via Scipy's scipy.optimize.linprog. It's typically used for bounded linear algebra problems and this is one. For parameters we use:

Zeros for c, the minimization coefficients, because we don't care about minimization
For A_eq, the equality constraint matrix, we pass a matrix of element counts. This is a length-4 matrix because there are four interquartile sections, each potentially with a slightly different element count. In your example these will each be close to 50.
For B_eq, the equality constraint right-hand side vector, we calculate the desired sum of all interquartile sections based on the desired mean.
For bounds we pass the bounds of each interquartile section.

One tricky aspect is that this assumes easily-divided sections, and a quantile calculation using the lower method. But there are at least thirteen methods! Some will be more difficult to target with an algorithm than others. Also, lower introduces statistical bias. I leave solving these edge cases as an exercise to the reader. But the example works:

import numpy as np
from scipy.optimize import linprog


def solve(length: int, mean: float,
          minimum_value: float, q1: float, median: float, q3: float,
          maximum_value: float) -> np.ndarray:
    sections = (np.arange(5)*(length - 1))//4
    sizes = np.diff(sections) - 1
    quartiles = np.array((minimum_value, q1, median, q3, maximum_value))

    # (quartiles   sizes@x)/length = mean
    # sizes@x = mean*length - quartiles
    result = linprog(c=np.zeros_like(sizes),
                     A_eq=sizes[np.newaxis, :],
                     b_eq=np.array((mean*length - quartiles.sum(),)),
                     bounds=np.stack((quartiles[:-1], quartiles[1:]), axis=1),
                     method='highs')
    if not result.success:
        raise ValueError(result.message)

    x = np.empty(length)
    x[sections] = quartiles
    for i, inner in enumerate(result.x):
        i0, i1 = sections[i: i 2]
        x[i0 1: i1] = inner
    return x


def summarise(x: np.ndarray) -> dict[str, float]:
    q0, q1, q2, q3, q4 = np.quantile(
        a=x, q=np.linspace(0, 1, num=5), method='lower')
    return {'length': len(x), 'mean': x.mean(),
            'minimum_value': q0, 'q1': q1, 'median': q2, 'q3': q3, 'maximum_value': q4}


def test() -> None:
    statistics = {'length': 200, 'mean': 30,  # 27.7 - 58.7 are solvable
                  'minimum_value': 5, 'q1': 13, 'median': 22, 'q3': 68, 'maximum_value': 132}
    x = solve(**statistics)
    for k, v in summarise(x).items():
        assert np.isclose(v, statistics[k])


if __name__ == '__main__':
    test()