I need to create a function that will compare a sample from a list (unknown_dataset) to two other datasets (red_dataset and blue_dataset) to find the shortest distance on then classify the sample. So far I have fucntions to calculate distance
#function to calculate all distances
def calculate_distance(p1, p2):
d = 0
for i in range(len(p1)):
d = (p2[i] - p1[i]) * (p2[i] - p1[i])
d = d**0.5
return d
and function to find the shortest distance
#function to find the shortest distance
def calculate_shortest_distance(sample, data_list):
min_dist = []
for list_sample in data_list:
dist = calculate_distance(sample, list_sample)
min_dist.append(dist)
return min(min_dist)
Now I need to use it to class the sample. My output needs to be [(0.67, 0.95, blue), (xxx, yyy, color),...] I am completely unable find a solution how to add color to the list. My code so far:
def calculate_membership(unknown_dataset, red_dataset, blue_dataset):
membership = unknown_dataset.copy()
for sample in unknown_dataset:
red_min = calculate_shortest_distance(sample, red_dataset)
blue_min = calculate_shortest_distance(sample, blue_dataset)
if red_min > blue_min:
membership.append("blue")
else:
membership.append("red")
return membership
Thank you for all your help.
EDIT: I need to write algorithm for the below: • Read 3 files for red, green, and unknown data sets • For each unknown sample in the unknown data set Calculate distances from the unknown sample to all red data samples Find min_1 (minimum distance of the above distances to red samples) Calculate distances from the unknown sample to all blue data samples Find min_2 (minimum distance of the above distances to blue samples) Compare min_1 and min_2 and assign class label to the unknown sample • Output all unknown samples and their class label to screen • Output all unknown samples and their class label to file
I am stuck on the "Compare min_1 and min_2 and assign class label to the unknown sample" step.
CodePudding user response:
Your approach seems to have some needless complications. This:
def calculate_distance(p1, p2):
d = 0
for i in range(len(p1)):
d = (p2[i] - p1[i]) * (p2[i] - p1[i])
d = d**0.5
return d
Effectively computes the sum of the absolute values of the distances for all parts of p1
and p2
, which appear to be vectors/n-tuples.
This would do the same:
def dist(p1, p2):
return sum(abs(v1 - v2) for v1, v2 in zip(p1, p2))
The second part:
def calculate_shortest_distance(sample, data_list):
min_dist = []
for list_sample in data_list:
dist = calculate_distance(sample, list_sample)
min_dist.append(dist)
return min(min_dist)
Seems to take a sample a single one of these vectors / n-tuples and tries to find the minimum value for calculate_distance
to a list of similar vectors / n-tuples data_list
. This would do the same:
def min_dist(sample, data_list):
return min(dist(sample, p) for p in data_list)
However, your code would work the same.
The part that apparently is giving you trouble is generating the required output from a function that takes a list of such sample vectors / n-tuples called unknown_dataset
and that compares each of its values to two similar lists (red_dataset
and blue_dataset
) and classifies each element based on the shortest distance of each vector / n-tuple to any of the elements in the red or blue dataset.
Using the functions you wrote, or rather the replacements above:
def membership(unknown, red, blue):
return [(*p, 'red' if min_dist(p, red) < min_dist(p, blue) else 'blue') for p in unknown]
Putting it all together with some example data:
def dist(p1, p2):
return sum(abs(v1 - v2) for v1, v2 in zip(p1, p2))
def min_dist(sample, data_list):
return min(dist(sample, p) for p in data_list)
def membership(unknown, red, blue):
return [(*p, 'red' if min_dist(p, red) < min_dist(p, blue) else 'blue') for p in unknown]
example_red = [(0.1, 0.2), (0.4, 0.6), (0.2, 0.7)]
example_blue = [(0.5, 0.3), (0.7, 0.1), (0.9, 0.9)]
example_unknown = [(0.1, 0.3), (0.2, 0.8), (0.6, 0.2)]
print(membership(example_unknown, example_red, example_blue))
Output
[(0.1, 0.3, 'red'), (0.2, 0.8, 'red'), (0.6, 0.2, 'blue')]
Or, using your own implementation of calculate_distance
and calculate_shortest_distance
:
# your two functions here
def calculate_membership(unknown_dataset, red_dataset, blue_dataset):
return [(*p,
'red' if (calculate_shortest_distance(p, red_dataset) <
calculate_shortest_distance(p, blue_dataset))
else 'blue') for p in unknown_dataset]
example_red = [(0.1, 0.2), (0.4, 0.6), (0.2, 0.7)]
example_blue = [(0.5, 0.3), (0.7, 0.1), (0.9, 0.9)]
example_unknown = [(0.1, 0.3), (0.2, 0.8), (0.6, 0.2)]
print(calculate_membership(example_unknown, example_red, example_blue))
Same output.