Home > database >  np.mean giving wrong value? PCA-Machine learning
np.mean giving wrong value? PCA-Machine learning

Time:04-11

I have a dataset and the first row in it is [[10, 9, 1, 7, 5, 3].., the rows number in my dataset is 14.

I noticed in the my lesson today, the teacher declared a mean value for each row in the dataset (Shown in the first 2 photos below) for the PCA standardization and subtracted the original data from it's own row's mean value. So I declared a c = data-data.mean() variable, and I expect the first row's mean to be (10 9 1 7 5 3) / 6 = 5.8 and c for the first index (10) to be (10 - 5.8) = 4.2, but when I printed c I found it that 10 was 3.01190476, however, when I printed out the mean I found it it wasn't for each row, but it was only 1 number which is 6.988095238095238.

My code:

dataset = pd.read_csv("cands_dataset.csv")
x = dataset.iloc[:, 1:].values

m = x.mean(dtype=np.float64);
print("x:",x) #########################
print("-------------------")
print("mean:",m) #########################
print("-------------------")
c = x - m
print("c:",c)

My output:

x: [[10  9  1  7  5  3]
 [10  8 10  8  8 10]
 [ 4  6  8  9  8  9]
 [ 9  9  6  6  8  7]
 [ 4  9 10  9  5  5]
 [ 7  6  9  9 10 10]
 [10  9  6  6  8  7]
 [ 5  4  2 10  9  1]
 [ 4 10  4  9  2  6]
 [ 8  7  7  7  7  9]
 [ 5  5  6  6  9  1]
 [ 9 10  1  9  5  6]
 [ 7  6  6  6 10  4]
 [10  9  9  7  9  4]]
-------------------
mean: 6.988095238095238
-------------------
c: [[ 3.01190476  2.01190476 -5.98809524  0.01190476 -1.98809524 -3.98809524]
 [ 3.01190476  1.01190476  3.01190476  1.01190476  1.01190476  3.01190476]
 [-2.98809524 -0.98809524  1.01190476  2.01190476  1.01190476  2.01190476]
 [ 2.01190476  2.01190476 -0.98809524 -0.98809524  1.01190476  0.01190476]
 [-2.98809524  2.01190476  3.01190476  2.01190476 -1.98809524 -1.98809524]
 [ 0.01190476 -0.98809524  2.01190476  2.01190476  3.01190476  3.01190476]
 [ 3.01190476  2.01190476 -0.98809524 -0.98809524  1.01190476  0.01190476]
 [-1.98809524 -2.98809524 -4.98809524  3.01190476  2.01190476 -5.98809524]
 [-2.98809524  3.01190476 -2.98809524  2.01190476 -4.98809524 -0.98809524]
 [ 1.01190476  0.01190476  0.01190476  0.01190476  0.01190476  2.01190476]
 [-1.98809524 -1.98809524 -0.98809524 -0.98809524  2.01190476 -5.98809524]
 [ 2.01190476  3.01190476 -5.98809524  2.01190476 -1.98809524 -0.98809524]
 [ 0.01190476 -0.98809524 -0.98809524 -0.98809524  3.01190476 -2.98809524]
 [ 3.01190476  2.01190476  2.01190476  0.01190476  2.01190476 -2.98809524]]
[42.28320829299149, 23.69899197734102, 18.64255211766758, 1.1547255972677037, 5.346968081158593, 8.873553933573632]
[35.09506288 19.67016334 15.47331826  0.95842225  4.43798351  7.36504976]

Teachers slide on how he did it: enter image description here Second Slide:

enter image description here

My dataset(Just in case if needed):

enter image description here

Did I understand the slides wrong? Or what's going on?

CodePudding user response:

I think when you use .iloc to subset the original dataset, you chose all the rows and all the columns after column 1, that's why the x has all the row values in it.

Please try this code if you want each mean value plus the difference row by row:

for i in range(0,len(practice)):
    x = practice.iloc[i, 1: ].values

    m = x.mean(dtype=np.float64);
    print("x:",x) #########################
    print("-------------------")
    print("mean:",m) #########################
    print("-------------------")
    c = x - m
    print("c:",c)  
    print('')

CodePudding user response:

Fixed it by changing the code to this m = x.mean() to m = x.mean(axis=1)

  • Related