Confused about StandardScaler().fit_transform() and pca.fit_transform() for PCA without training dat-CodePudding

I've read through all of the questions about these libraries previously posted, but they all seem to either involve training/test data, or are only asking about PCA() or SS(). I'm confused as to the difference and really want to make sure I'm processing my data correctly.

I have a dataset of neuron signals over time. I want each time-point as a point on the scatterplot. So I have each neuron in a column, and time as the index. Here is basically what I did:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler as ss

def get_pca(df: pd.DataFrame):

    # Scale so that mean = 0 and st.deviation = 1.
    df = ss().fit_transform(df)

    # Need to fit_transform again? 
    pca = PCA()
    df = pca.fit_transform(df)

    # Pull out percentage of variance explained
    variance = np.round(
        pca.explained_variance_ratio_ * 100, decimals=1)
    labels = ['PC'   str(x) for x in range(1, len(variance)   1)]
    
    df = pd.DataFrame(df, columns=labels)

    return df, variance, labels

# Make dummy dataframe as reproducible example
neurons = list('ABCD')

df = pd.DataFrame(
    np.random.randint(0,100,size=(15, 4)),
    columns=neurons)

df, loading_scores, components = get_pca(df)

I can't figure out what I'm doing to this data, why I'm scaling with StandardScalar().fit_transform and then again with PCA().fit_transform and if it's the proper method for what I'm trying to achieve. Can anyone give me some insight into this process?

CodePudding user response：

Standard Scaler is used to scale the data using the Z-score value of individual instance in the data.

This scaling/standardization makes the data in different dimension with different scales come into same scale, which in later stages help Gradient descent algorithm to reach optima efficiently. Here you are using just a random data so it won't be any problem though.

You would need to understand the basic of this process from here.

And PCA helps you in reducing dimensions, i.e., bring data scattered in various dimension into fewer dimensions.

Check out this interactive tool by TensorFlow to visually understand what PCA or other dimensionality reduction technique does to data.

Basically your code is just a basic demonstration of these 2 things in practice.

CodePudding user response：

As the name suggests, PCA is the Analysis Principal component of your dataset. So, PCA transforms your data in a way that its first data point (PC_1 in your case) represents the most principal component across your dataset. Similarly, The second one is the second most important component of your data.

The main application of PCA is dimensionality reduction (e.g., projects M data points to N (where N < M) points in a way that you keep the most critical information across the dataset (N principal components). In other words, you linearly reduce dimensionality using SVD (Singular Value Decomposition) of the data to project it to a lower-dimensional space.

In your case, M=N means you are not reducing dimensionality; still, your PC_1 is more principal than the rest. You can see the contribution of each value on explained_variance_ratio_, which is the loading_scores variable in your code.

Initially, you have to normalize data as you are doing through StandardScaler and then pass it to PCA. This increases the variance which helps the PC_1 to have a better final explained_variance_ratio_,. check this to see what happens if you don't normalize. Also, fit_transform is just a method to execute the defined transform object on data.

Conclusion:

If you want to visualize your data, you should pass n_components=3 (pca = PCA(n_components=3)) for 3D visualization or n_components=2 for 2D. So then, PCA will project data in 2 or 3 visualizable points. you can also visualize the first 3 points of your four as you currently have, but the visualization will contain explained_variance_ratio_ of the first three parameters. (you can see this amount by print(loading_scores[:3])