I am given the following statistics of an array:
- length
- Minimum
- Maximum
- Average
- Median
- Quartiles
I am supposed to recreate a list with more or less the same statistics. I know that the list for which the statistics were calculated is not normally distributed.
My first idea was to just brute-force it by creating a list of random numbers in the given range and hope that one would fit. The benefit of this method is it works. While the downside obviously is the efficiency.
So I'm looking for a more efficient way to solve this problem. Hope that someone can help...
P.S. Currently I only use numpy but I'm not limited to it.
Edit 1: As an example input and output was requested: A input could look as follows:
statistics = {
'length' : 200,
'minimum_value' : 5,
'maximum_vlaue': 132,
'mean': 30,
'median' : 22,
'Q1': 13,
'Q3': 68
}
The desired output would than look like this:
similar_list = function_to_create_similar_list(statistics)
len(similar_list) # should be roughly 200
min(similar_list) # should be roughly 5
max(similar_list) # should be roughly 132
np.mean(similar_list) # should be roughly 30
np.median(similar_list) # should be roughly 22
np.quantile(similar_list, 0.25) # should be roughly 13
np.quantile(similar_list, 0.75) # should be roughly 68
function_to_create_similar_list is the function I want to define
Edit 2.
My first edit was not enough I'm sorry for that. Here is my current code:
def get_statistics(input_list):
output = {}
output['length'] = len(input_list)
output['minimum_value'] = min(input_list)
output['maximum_value'] = max(input_list)
output['mean'] = np.mean(input_list)
output['median'] = np.median(input_list)
output['q1'] = np.quantile(input_list, 0.25)
output['q3'] = np.quantile(input_list, 0.75)
return output
def recreate_similar_list(statistics, maximum_deviation = 0.1 ):
sufficient_list_was_found = False
while True:
candidate_list = [random.uniform(statistics['minimum_value'],statistics['maximum_value']) for _ in range(statistics['length'])]
candidate_statistics = get_statistics(candidate_list)
sufficient_list_was_found = True
for key in statistics.keys():
if np.abs(statistics[key] - candidate_statistics[key]) / statistics[key] > maximum_deviation:
sufficient_list_was_found = False
break
if(sufficient_list_was_found):
return candidate_list
example_input_list_1 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,10]
recreated_list_1 = recreate_similar_list(get_statistics(example_input_list_1),0.3)
print(recreated_list_1)
print(get_statistics(recreated_list_1))
example_input_list_2 = [1,1,1,1,3,3,4,4,4,4,4,5,18,19,32,35,35,42,49,68]
recreated_list_2 = recreate_similar_list(get_statistics(example_input_list_2),0.3)
print(recreated_list_2)
print(get_statistics(recreated_list_2))
The first example can find a solution. That was no surprise to me. The second one does not (or not in sufficient time). That also did not surprise me as the lists generated in the recreate_similar_list
function are uniformly distributed. Though both examples represent the real task. (Keep in mind that I only get the statistics not the list)
I hope this is now a sufficient example
CodePudding user response:
Your existing solution is interesting, but effectively a bogo-solution. There are direct solutions possible that do not need to rely on random-and-check.
The easy-ish part is to create the array of a correct length, and place all five min/max/quartiles in their appropriate positions (this only works for a somewhat simple interpretation of the problem and has limitations).
The trickier part is to choose "fill values" between the quartiles. These fill values can be identical within one interquartile section, because the only things that matter are the sum and bounds. One fairly straightforward way is linear programming, via Scipy's scipy.optimize.linprog. It's typically used for bounded linear algebra problems and this is one. For parameters we use:
- Zeros for
c
, the minimization coefficients, because we don't care about minimization - For
A_eq
, the equality constraint matrix, we pass a matrix of element counts. This is a length-4 matrix because there are four interquartile sections, each potentially with a slightly different element count. In your example these will each be close to 50. - For
B_eq
, the equality constraint right-hand side vector, we calculate the desired sum of all interquartile sections based on the desired mean. - For
bounds
we pass the bounds of each interquartile section.
One tricky aspect is that this assumes easily-divided sections, and a quantile calculation using the lower
method. But there are at least thirteen methods! Some will be more difficult to target with an algorithm than others. Also, lower
introduces statistical bias. I leave solving these edge cases as an exercise to the reader. But the example works:
import numpy as np
from scipy.optimize import linprog
def solve(length: int, mean: float,
minimum_value: float, q1: float, median: float, q3: float,
maximum_value: float) -> np.ndarray:
sections = (np.arange(5)*(length - 1))//4
sizes = np.diff(sections) - 1
quartiles = np.array((minimum_value, q1, median, q3, maximum_value))
# (quartiles sizes@x)/length = mean
# sizes@x = mean*length - quartiles
result = linprog(c=np.zeros_like(sizes),
A_eq=sizes[np.newaxis, :],
b_eq=np.array((mean*length - quartiles.sum(),)),
bounds=np.stack((quartiles[:-1], quartiles[1:]), axis=1),
method='highs')
if not result.success:
raise ValueError(result.message)
x = np.empty(length)
x[sections] = quartiles
for i, inner in enumerate(result.x):
i0, i1 = sections[i: i 2]
x[i0 1: i1] = inner
return x
def summarise(x: np.ndarray) -> dict[str, float]:
q0, q1, q2, q3, q4 = np.quantile(
a=x, q=np.linspace(0, 1, num=5), method='lower')
return {'length': len(x), 'mean': x.mean(),
'minimum_value': q0, 'q1': q1, 'median': q2, 'q3': q3, 'maximum_value': q4}
def test() -> None:
statistics = {'length': 200, 'mean': 30, # 27.7 - 58.7 are solvable
'minimum_value': 5, 'q1': 13, 'median': 22, 'q3': 68, 'maximum_value': 132}
x = solve(**statistics)
for k, v in summarise(x).items():
assert np.isclose(v, statistics[k])
if __name__ == '__main__':
test()