res_M = minimize(L_M, x0=x_M, args=(data, w_vector),
method='L-BFGS-B', bounds=[(0.001, 1), (0.001, 1), (0.001, 1)])
def L_M(x, data, w_vector):
sum = 0
for i in range(len(data)):
sum = w_vector[i]*(data[i][0]*np.log(x[0]) data[i][1]*np.log(x[1]) data[i][2]*np.log(x[2]))
return -1*sum
As part of an Expectation-Maximization(EM) algorithm I am calling SciPy's optimize.minimize
function in the M-step. x_M are three values between 0 and 1, initially all 0.5. The w_vectors are calculated in the E-Step, and consist of a NumPy 1D array of the lengths of the data set with floats in the range 0 and 1. Each line in the data set is three integer feature values between 0 and 3, for example [1 0 2].
The for loop in the objective function is slowing things down. I want to optimize it using vectorized calculations instead. I have tried the following, but it changes the result:
def L_M(x, data, w_vector):
length = len(data)
a_i = data[np.arange(length)][0].sum()
f_i = data[np.arange(length)][1].sum()
l_i = data[np.arange(length)][2].sum()
sum = (w_vector[np.arange(length)].sum())*(a_i*np.log(x[0]) f_i *np.log(x[1]) l_i*np.log(x[2]))
return -1*sum
The minimize function is getting called many times and I hope to test it on some very large data sets so any ideas on how to rewrite it would be much appreciated.
CodePudding user response:
You should convert all your arrays into NumPy arrays and then this can be achieved as follows:
import numpy as np
data = np.array([[1, 0, 2], [2, 1, 0]])
w_vector = np.array([0, 1]
def L_M(x : np.ndarray, data : np.ndarray, w_vector : np.ndarray):
result = np.sum(w_vector * np.sum(data*np.log(x), axis = 1))
return -result
This part of the code ((data[i][0]*np.log(x[0]) data[i][1]*np.log(x[1]) data[i][2]*np.log(x[2]))
), where you multiply each element of data
at ith position with log of each element of x
and take the sum of all three, is replaced by np.sum(data*np.log(x), axis = 1)
where element-wise multiplication is achieved (as these are np.array
) and the sum is taken row-wise and the sum of each row is returned inside a 1D-array.
Afterward, this array is multiplied by w_vector
(as these both have the same length and are np.array
, element-wise multiplication is possible).
Finally, the sum of the resulting array is taken and saved into result
. For optimization, pass x_M
also as a NumPy array:
from scipy.optimize import minimize
x_M = np.array([0.5, 0.5, 0.5])
res_M = minimize(L_M, x0=x_M, args=(data, w_vector),
method='L-BFG-B', bounds=[(0.001, 1), (0.001, 1), (0.001, 1)])
P.S.: Avoid using variable names like sum
as it is already a Python function and not a good practice IMHO.