Gradient descent for linear regression with numpy-CodePudding

I want to implement gradient descent with numpy for linear regression but I have some error in this code:

import numpy as np

# Code Example
rng = np.random.RandomState(10)
X = 10*rng.rand(1000, 5) # feature matrix
y = 0.9   np.dot(X, [2.2, 4, -4, 1, 2]) # target vector

# GD implementation for linear regression
def GD(X, y, eta=0.1, n_iter=20):
    theta = np.zeros((X.shape[0], X.shape[1]))
    for i in range(n_iter):
        grad = 2 * np.mean((np.dot(theta.T, X) - y) * X)
        theta = theta - eta * grad
    return theta

# SGD implementation for linear regression
def SGD(X, y, eta=0.1, n_iter=20):
    theta = np.zeros(1, X.shape[1])
    for i in range(n_iter):
        for j in range(X.shape[0]):
            grad = 2 * np.mean((np.dot(theta.T, X[j,:]) - y[j]) * X[j,:])
            theta = theta - eta * grad
    return theta

# MSE loss for linear regression with numpy
def MSE(X, y, theta):
    return np.mean((X.dot(theta.T) - y)**2)

# linear regression with GD and MSE with numpy
theta_gd = GD(X, y)
theta_sgd = SGD(X, y)

print('MSE with GD: ', MSE(X, y, theta_gd))
print('MSE with SGD: ', MSE(X, y, theta_sgd))

The error is

grad = 2 * np.mean((np.dot(theta.T, X) - y) * X)
ValueError: operands could not be broadcast together with shapes (5,5) (1000,)

and I can't solve it.

CodePudding user response：

Minor changes in your code that resolve dimensionality issues during matrix multiplication make the code run successfully. In particular, note that a linear regression on a design matrix X of dimension Nxk has a parameter vector theta of size k.

In addition, I'd suggest some changes in SGD() that make it a proper stochastic gradient descent. Namely, evaluating the gradient over random subsets of the data realized as realized by randomly partitioning the index set of the train data with np.random.shuffle() and looping through it. The batch_size determines the size of each subset after which the parameter estimate is updated. The argument seed ensures reproducibility.

# GD implementation for linear regression
def GD(X, y, eta=0.001, n_iter=100):
    theta = np.zeros(X.shape[1])
    for i in range(n_iter):
        for j in range(X.shape[0]):
            grad = (2 * np.mean(X[j,:] @ theta - y[j]) * X[j,:])  # changed line
            theta -= eta * grad
    return theta

# SGD implementation for linear regression
def SGD(X, y, eta=0.001, n_iter=1000, batch_size=25, seed=7678):
    theta = np.zeros(X.shape[1])
    indexSet = list(range(len(X)))
    np.random.seed(seed)
    for i in range(n_iter):
        np.random.shuffle(indexSet) # random shuffle of index set
        for j in range(round(len(X) / batch_size) 1):
            X_sub = X[indexSet[j*batch_size:(j 1)*batch_size],:]
            y_sub = y[indexSet[j*batch_size:(j 1)*batch_size]]
            if(len(X_sub) > 0):
                grad = (2 * np.mean(X_sub @ theta - y_sub) * X_sub)  # changed line
                theta -= eta * np.mean(grad, axis=0)
    return theta

Running the code, I get

print('MSE with GD : ',  MSE(X, y, theta_gd))
print('MSE with SGD: ', MSE(X, y, theta_sgd))
> MSE with GD :  0.07602
  MSE with SGD:  0.05762

CodePudding user response：

Each observation has 5 features, and X contains 1000 observations:

X = rng.rand(1000, 5) * 10  # X.shape == (1000, 5)

Create y which is perfectly linearly correlated with X (with no distortions):

real_weights = np.array([2.2, 4, -4, 1, 2]).reshape(-1, 1)
real_bias = 0.9
y = X @ real_weights   real_bias  # y.shape == (1000, 1)

G.D. implementation for linear regression:

Note: w (weights) is your theta variable. I have also added the calculation of b (bias).

def GD(X, y, eta=0.1, n_iter=20):
    # Initialize weights and a bias (all zeros):
    w = np.zeros((X.shape[1], 1))  # w.shape == (5, 1)
    b = 0
    # Gradient descent
    for i in range(n_iter):
        errors = X @ w   b - y  # errors.shape == (1000, 1)
        dw = 2 * np.mean(errors * X, axis=0).reshape(5, 1)
        db = 2 * np.mean(errors)
        w -= eta * dw
        b -= eta * db
    return w, b

Testing:

w, b = GD(X, y, eta=0.003, n_iter=5000)
print(w, b)

[[ 2.20464905]
 [ 4.00510139]
 [-3.99569374]
 [ 1.00444026]
 [ 2.00407476]] 0.7805448262466914

Notes:

Your function SGD also contains some error..
I'm using the @ operator because it's just my preference over np.dot.