Home > Mobile >  What is the most efficient way to bootstrap the mean of a list of numbers?
What is the most efficient way to bootstrap the mean of a list of numbers?

Time:10-19

I have a list of numbers (floats) and I would like to estimate the mean. I also need to estimate the variation of such mean. My goal is to resample the list 100 times, and my output would be an array with length 100, each element corresponding to the mean of a resampled list.

Here is a simple workable example for what I would like to achieve:

import numpy as np
data = np.linspace(0, 4, 5)
ndata, boot = len(data), 100
output = np.mean(np.array([data[k] for k in np.random.uniform(high=ndata, size=boot*ndata).astype(int)]).reshape((boot, ndata)), axis=1)

This is however quite slow when I have to repeat for many lists with large number of elements. The method also seems very clunky and un-Pythonic. What would be a better way to achieve my goal?

P.S. I am aware of scipy.stats.bootstrap, but I have problem upgrading scipy to 1.7.1 in anaconda to import this.

CodePudding user response:

Use np.random.choice:

import numpy as np

data = np.linspace(0, 4, 5)
ndata, boot = len(data), 100
output = np.mean(
    np.random.choice(data, size=(100, ndata)),
    axis=1)

If I understood correctly, this expression (in your question's code):

np.array([data[k] for k in np.random.uniform(high=ndata, size=boot*ndata).astype(int)]).reshape((boot, ndata)

is doing a sampling with replacement and that is exactly what np.random.choice does.

Here are some timings for reference:

%timeit np.mean(np.array([data[k] for k in np.random.uniform(high=ndata, size=boot*ndata).astype(int)]).reshape((boot, ndata)), axis=1)
133 µs ± 3.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.mean(np.random.choice(data, size=(boot, ndata)),axis=1)
41.1 µs ± 538 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As it can be seen np.random.choice yields 3x improvement.

  • Related