Home > database >  Define a random variable
Define a random variable

Time:11-22

I'm new with Python. I have to create a new rv U(t). Assuming that Z has a standard normal distribution c = 1.57, I have that:

U(t) = 0 if Z(t) <= c
U(t) = Φ(Z(t)) Z(t) > c

Where Φ(·) is the cdf of the standard normal distribution N(0, 1).

I start sampling random numbers from the normal distribution and I create an array of zeros:

z = np.random.normal(0, 1, 100)
u = np.zeros([1, 100])

Then, I write the following loop:

for i in list(range(1, 101, 1)):
    if z[:i] < c:
        u[:i] = 0
    else z[:i] > c:
        u[:i] = norm.cdf(z[:i], loc=0, scale=1) 

However, there is something wrong. I got this error:

File "<ipython-input-837-1cbdd0641a75>", line 4
    else z[:i] > c:
         ^
SyntaxError: invalid syntax

Can someone please help me find out where the error is? Or suggest another way to deal with the problem?

Thank you!

CodePudding user response:

Use the power of numpy!

z = np.random.normal(0, 1, 100)
u = np.zeros(z.shape)

Since you initialized u to zeros, you don't need to do anything for the z <= c cases. For the others, you can use numpy's logical indexing to only set the elements that fulfill the condition

# Get only the elements of z where z > c
z_filt = z[z > c]

# Calculate the norm.cdf at these values of z, 
norm_vals = norm.cdf(z_filt, loc=0, scale=1)

# Assign those to the elements of u where z > c
u[z > c] = norm_vals

Of course, you can condense these three lines down to a single line:

u[z > c] = norm.cdf(z[z > c], loc=0, scale=1)

This approach will be significantly faster than iterating over the arrays and setting individual elements.


If you're curious why your code didn't work and how to fix it,

  • You don't need to list out a range to iterate on it. Just for i in range(100) is good enough
  • z[:i] < c will give an array containing i boolean values. Putting an if condition on that will give you an error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). I suspect you meant to do z[i] < c
  • u[:i] = 0 will set all elements of u from the start to the ith index to zero. You probably only wanted u[i] = 0. This is really not even necessary since you already initialized u to be zeros
  • else <condition> doesn't work. You want elif <condition>. Although you don't really need that here, since you could just do if z[i] > c and that's the only condition you need.
for i in range(100):
    if z[i] > c:
        u[i] = norm.cdf(z[i], loc=0, scale=1) 

Comparing the runtimes of these two approaches:

import timeit
import numpy as np
from scipy.stats import norm
from matplotlib import pyplot as plt

def f_numpy(size):
    z = np.random.normal(0, 1, size)
    u = np.zeros(z.shape)
    u[z > c] = norm.cdf(z[z > c], loc=0, scale=1)
    return u

def f_loopy(size):
    z = np.random.normal(0, 1, size)
    u = np.zeros(z.shape)
    for i in range(size):
        if z[i] > c:
            u[i] = norm.cdf(z[i], loc=0, scale=1)    
    return u

c = 0

sizes = [10, 100, 1000, 10_000, 100_000]
times = np.zeros((len(sizes), 2))
for i, s in enumerate(sizes):
    times[i, 0] = timeit.timeit('f_numpy(s)', globals=globals(), number=100) / 100
    print(">")
    times[i, 1] = timeit.timeit('f_loopy(s)', globals=globals(), number=10) / 10
    print(".")
    
fig, ax = plt.subplots()
ax.plot(sizes, times[:, 0], label="numpy")
ax.plot(sizes, times[:, 1], label="loopy")
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Array size')
ax.set_ylabel('Time per function call (s)')
ax.legend()
fig.tight_layout()

we get the following plot, which shows that the numpy approach is orders of magnitude faster than the loopy approach. Runtime comparison -- numpy vs. loopy

  • Related