Assign class to a sample in Python-CodePudding

I need to create a function that will compare a sample from a list (unknown_dataset) to two other datasets (red_dataset and blue_dataset) to find the shortest distance on then classify the sample. So far I have fucntions to calculate distance

#function to calculate all distances
def calculate_distance(p1, p2):
    d = 0 
    for i in range(len(p1)): 
        d  = (p2[i] - p1[i]) * (p2[i] - p1[i])  
    d = d**0.5 
    return d

and function to find the shortest distance

#function to find the shortest distance
def calculate_shortest_distance(sample, data_list):
    min_dist = []
    for list_sample in data_list:
        dist = calculate_distance(sample, list_sample)
        min_dist.append(dist)
    return min(min_dist)

Now I need to use it to class the sample. My output needs to be [(0.67, 0.95, blue), (xxx, yyy, color),...] I am completely unable find a solution how to add color to the list. My code so far:

def calculate_membership(unknown_dataset, red_dataset, blue_dataset):
    membership = unknown_dataset.copy()
    for sample in unknown_dataset:
        red_min = calculate_shortest_distance(sample, red_dataset)
        blue_min = calculate_shortest_distance(sample, blue_dataset)
        if red_min > blue_min:
            membership.append("blue")
        else:
            membership.append("red")
    return membership

Thank you for all your help.

EDIT: I need to write algorithm for the below: • Read 3 files for red, green, and unknown data sets • For each unknown sample in the unknown data set Calculate distances from the unknown sample to all red data samples Find min_1 (minimum distance of the above distances to red samples) Calculate distances from the unknown sample to all blue data samples Find min_2 (minimum distance of the above distances to blue samples) Compare min_1 and min_2 and assign class label to the unknown sample • Output all unknown samples and their class label to screen • Output all unknown samples and their class label to file

I am stuck on the "Compare min_1 and min_2 and assign class label to the unknown sample" step.

CodePudding user response：

Your approach seems to have some needless complications. This:

def calculate_distance(p1, p2):
    d = 0 
    for i in range(len(p1)): 
        d  = (p2[i] - p1[i]) * (p2[i] - p1[i])  
    d = d**0.5 
    return d

Effectively computes the sum of the absolute values of the distances for all parts of p1 and p2, which appear to be vectors/n-tuples.

This would do the same:

def dist(p1, p2):
    return sum(abs(v1 - v2) for v1, v2 in zip(p1, p2))

The second part:

def calculate_shortest_distance(sample, data_list):
    min_dist = []
    for list_sample in data_list:
        dist = calculate_distance(sample, list_sample)
        min_dist.append(dist)
    return min(min_dist)

Seems to take a sample a single one of these vectors / n-tuples and tries to find the minimum value for calculate_distance to a list of similar vectors / n-tuples data_list. This would do the same:

def min_dist(sample, data_list):
    return min(dist(sample, p) for p in data_list)

However, your code would work the same.

The part that apparently is giving you trouble is generating the required output from a function that takes a list of such sample vectors / n-tuples called unknown_dataset and that compares each of its values to two similar lists (red_dataset and blue_dataset) and classifies each element based on the shortest distance of each vector / n-tuple to any of the elements in the red or blue dataset.

Using the functions you wrote, or rather the replacements above:

def membership(unknown, red, blue):
    return [(*p, 'red' if min_dist(p, red) < min_dist(p, blue) else 'blue') for p in unknown]

Putting it all together with some example data:

def dist(p1, p2):
    return sum(abs(v1 - v2) for v1, v2 in zip(p1, p2))


def min_dist(sample, data_list):
    return min(dist(sample, p) for p in data_list)


def membership(unknown, red, blue):
    return [(*p, 'red' if min_dist(p, red) < min_dist(p, blue) else 'blue') for p in unknown]


example_red = [(0.1, 0.2), (0.4, 0.6), (0.2, 0.7)]
example_blue = [(0.5, 0.3), (0.7, 0.1), (0.9, 0.9)]
example_unknown = [(0.1, 0.3), (0.2, 0.8), (0.6, 0.2)]

print(membership(example_unknown, example_red, example_blue))

Output

[(0.1, 0.3, 'red'), (0.2, 0.8, 'red'), (0.6, 0.2, 'blue')]

Or, using your own implementation of calculate_distance and calculate_shortest_distance:

# your two functions here

def calculate_membership(unknown_dataset, red_dataset, blue_dataset):
    return [(*p,
             'red' if (calculate_shortest_distance(p, red_dataset) <
                       calculate_shortest_distance(p, blue_dataset))
             else 'blue') for p in unknown_dataset]


example_red = [(0.1, 0.2), (0.4, 0.6), (0.2, 0.7)]
example_blue = [(0.5, 0.3), (0.7, 0.1), (0.9, 0.9)]
example_unknown = [(0.1, 0.3), (0.2, 0.8), (0.6, 0.2)]

print(calculate_membership(example_unknown, example_red, example_blue))

Same output.