I am writing a program that utilizes NumPy to calculate accuracy between testing and training points, but I am not sure how to utilize the vectorized functions as opposed to the for loops I have used in my code.
Here is my code(Is there a way to simply the code so that I do not need any loops?)
ty#command to import NumPy package
import numpy as np
iris_train=np.genfromtxt("iris-train-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
iris_test=np.genfromtxt("iris-test-data.csv",delimiter=',',usecols=(0,1,2,3),dtype=float)
train_cat=np.genfromtxt("iris-training-data.csv",delimiter=',',usecols=(4),dtype=str)
test_cat=np.genfromtxt("iris-testing-data.csv",delimiter=',',usecols=(4),dtype=str)
correct = 0
for i in range(len(iris_test)):
n = 0
old_distance = float('inf')
while n < len(iris_train):
#finding the difference between test and train point
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
#summing up the calculated differences
iris_sum = sum(iris_diff)
new_distance = float(np.sqrt(iris_sum))
#if statement to update distance
if new_distance < old_distance:
index = n
old_distance = new_distance
n = 1
print(i 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct = 1
accuracy = ((correct)/float((len(iris_test)))*100)
print(f"Accuracy:{accuracy: .2f}%")pe here
:
CodePudding user response:
The trick with computing the distances is to insert extra dimensions using numpy.newaxis
and use broadcasting to compute a matrix with the distance from every testing sample to every training sample in one vectorized operation. Using numpy's broadcasting rules, diff
has shape (num_test_samples, num_train_samples, num_features)
, and distance
has shape (num_test_samples, num_train_samples)
since we summed along the last axis in the call to numpy.sum
.
Then you can use numpy.argmin
to find the index of the closest training sample for every testing sample.
Finally you can use this index to compute correct
in a vectorized fashion by summing the number of True
elements in a boolean array.
# Compute the distance from every training sample to every testing sample
# Note that `np.sqrt` is not necessary since sqrt is a monotonically
# increasing function -- removing it doesn't change the answer
diff = iris_test[:, np.newaxis] - iris_train[np.newaxis, :]
distance = np.sqrt(np.sum(np.square(diff), axis=-1))
# Compute the index of the closest training sample to the testing sample
index = np.argmin(distance, axis=-1)
# Check if class of the closest training sample matches the class
# of the testing sample
correct = (test_cat == train_cat[index]).sum()
CodePudding user response:
If I get correctly what you are doing (but I don't really need to, to answer the question), for each vector of iris_test
, you are searching for the closest one in isis_train
. Closest being here in the sense of euclidean distance.
So you have 3 embedded loop (pseudo-python)
for u in iris_test:
for v in iris_train:
s=0
for i in range(dimensionOfVectors):
s =(iris_test[i]-iris_train[i])**2
dist=sqrt(s)
You are right to try to get rid of python loops. And the most important one to get rid of is the inner one. And you already got rid of this one. Since the inner loop of my pseudo code is, in your code, implicitly in:
iris_diff = (abs(iris_test[i] - iris_train[n])**2)
and
iris_sum = sum(iris_diff)
Both those line iterates through all dimensions of your vectors. But do it not in python but in internal numpy code, so it is fast.
One may object that you don't really need abs
after a **2
, that you could have called the np.linalg.norm
function that does all those operations in one call
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
which is faster than your code. But at least, in your code, that loop over all components of the vectors is already vectorized.
The next stage is to vectorize the middle loop.
That also can be accomplished. Instead of computing one by one
new_distance = np.linalg.norm(iris_test[i]-iris_train[n])
You could compute in one call all the len(iris_train)
distances between iris_test[i]
and all iris_train[n]
.
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
The trick here lies in numpy broadcasting and axis
parameter
- broadcasting means that you can compute the difference between a 1D, length W vector, and a 2D n×W array (
iris_test[0]
is a 1D vector, andiris_train
is 2D-array whose number of columns is the same as the length ofiris_test[0]
). Because in such case, numpy broadcasts the 1st operator, and returns a 2D n×W array as result, whose each line k isiris_test[0] - iris_train[k]
. - Calling
np.linalg.norm
on that n×W 2D matrix would return a single float (the norm of the whole matrix). Unless you restrict the norm to the 2nd axis (axis=1
). In which case, it returns n floats, each of them being the norm of one row.
In other words, after the previous line of code, new_distances[k]
is the distance between iris_test[i]
and iris_train[k]
.
Once that done, you can easily find k
such as this distance is the smallest, using np.argmin
.
np.argmin(new_distances)
is the index of the smallest of the distances.
So, all together, your code could be rewritten as:
correct = 0
for i in range(len(iris_test)):
new_distances = np.linalg.norm(iris_test[i]-iris_train, axis=1)
index=np.argmin(new_distances)
#printing out classifications
print(i 1, test_cat[i], train_cat[index])
if test_cat[i] == train_cat[index]:
correct = 1