Working with the Iris dataset via k-clustering-CodePudding

I am working with the Iris data set and have made a features dataframe to work with the measurements. Also, I have centroids of three made-up points that I am trying to work with. I have these centroids stored as three separate lists.

What I am trying to do is find the distance between each measurement in the centroid (one centroid at a time) for all 150-ish rows in the dataframe. For example:

centroid1=[5.1,3.4,1.2,0.2]

#first row of df_features:
5.1, 3.5, 1.4, 0.2

I am wondering how I can iterate over every row of the of the features dataframe to measure the cartesian coordinate distance between each value in the row and the respective values of my centroid.

Do I make them both into numpy arrays? Do I make the centroid into a pandas dataframe? Do I make the array into a list?

My distance function is already defined as:

def dis(x,y):
    distance=0
    for i in range(len(x)):
        distance= distance   (y[i]-x[i])**2
    return distance**.5

Should I be using a different function? I am kind of lost on how to proceed here.

Also, I am trying to do this with simple code, without importing any other libraries other than numpy and pandas because I am trying to understand how to actually code this. Thanks.

CodePudding user response：

with pandas.DataFrame() and just a little bit modification it could be:

# it could be used dictionary type or whatever you want
centroids = {"centroid1":[5.1,3.4,1.2,0.2],
             "centroid2":[3.0,3.0,3.0,3.0],
             "centroid3":[2.0,2.0,2.0,2.0]}

def dis(x, y):
    distance = 0
    for i in range(len(x)):
        distance = distance   (y[i]-x[i])**2
    return distance**.5

iris = pd.DataFrame([[5.1, 3.5, 1.4, 0.2]])

# loops for both centroids and data with your function
for cent in centroids:   
    print(iris.apply(lambda row: dis(row, centroids[cent]), axis=1).values)

CodePudding user response：

#First, I make each row of the Iris dataframe into a list.
sl_list=df_features['sepal length'].to_list()
sw_list=df_features['sepal width'].to_list()
pl_list=df_features['petal length'].to_list()
pw_list=df_features['petal width'].to_list()

#Then, I make each centroid dataframe into a list as well
centroid_sl_list=df_cents['sepal length'].to_list()
centroid_sw_list=df_cents['sepal width'].to_list()
centroid_pl_list=df_cents['petal length'].to_list()
centroid_pw_list=df_cents['petal width'].to_list()

#I first fleshed out the below function as much as I could, but then I realized
#I would have to make empty lists to hold the data, then realized I would have
#to make lists for each centorid.
sl_dist_cent1=[]
sl_dist_cent2=[]
sl_dist_cent3=[]

sw_dist_cent1=[]
sw_dist_cent2=[]
sw_dist_cent3=[]

pl_dist_cent1=[]
pl_dist_cent2=[]
pl_dist_cent3=[]

pw_dist_cent1=[]
pw_dist_cent2=[]
pw_dist_cent3=[]

#Then I really began working out my function. I have several lists because I
#wanted to output to the lists related to the iris measurement without having
#to make separate functions for each list.
def dis(cent_list, feat_list, dist_list1, dist_list2, dist_list3):
  #I set up a count to be able to delineate which list to append the distance to
  #I also set up a distance variable to hold the distance to be appended.
  count=0
  distance=0
  #Now I set up a for loop to iterate over each value in the features dataframe
  #for each value in the centroid dataframe
  for i in cent_list:
    for j in feat_list:
     #input my distance equation to work on the values 
     distance = (((i-j)**2)**.5)
     #I increased the count here because I like it when my counts match the 
     #real-world row value, not the index. I then set up if loops to select the
     #list to append to.
     count =1
     if count<=150:
      dist_list1.append(distance)
     if count>150 and count<=300:
       dist_list2.append(distance)
     if count>300:
       dist_list3.append(distance)
     #I reset the distance here to prep for the next iteration  
     distance=0

dis(centroid_sl_list,sl_list, sl_dist_cent1, sl_dist_cent2, sl_dist_cent3)

#Checking to make sure that each list has the proper length.
#print(len(sl_dist_cent1))
#print(len(sl_dist_cent2))
#print(len(sl_dist_cent3))

#Now I have to repeat for the other columns.

dis(centroid_sw_list,sw_list, sw_dist_cent1, sw_dist_cent2, sw_dist_cent3)
dis(centroid_pl_list,pl_list, pl_dist_cent1, pl_dist_cent2, pl_dist_cent3)
dis(centroid_pw_list,pw_list, pw_dist_cent1, pw_dist_cent2, pw_dist_cent3)

#Now to make a new dataframes with the columns:

centroid1_dist=pd.DataFrame({'sepal length distance': sl_dist_cent1, 'sepal width distance':sw_dist_cent1, 'petal length distance':pl_dist_cent1, 'petal width distance':pw_dist_cent1})
centroid2_dist=pd.DataFrame({'sepal length distance': sl_dist_cent2, 'sepal width distance':sw_dist_cent2, 'petal length distance':pl_dist_cent2, 'petal width distance':pw_dist_cent2})
centroid3_dist=pd.DataFrame({'sepal length distance': sl_dist_cent3, 'sepal width distance':sw_dist_cent3, 'petal length distance':pl_dist_cent3, 'petal width distance':pw_dist_cent3})