I have a dataset of around 76000 columns. As I am unable to inspect each column by hand, I try to remove the unnecessary ones. One of my chosen ways is to use the low variance filter. Nonetheless, since variance depends on the range of data, I would need to normalize it (I notice some columns return high variance since the value is in millions while other columns that are in decimal points return small variance.)
Nonetheless, after using scale
function in R on all my columns, I noticed that all of my columns now have a variance of 1. I am literally so confuse on how to implement the low variance filter now. I'm using this website to do low-variance filter (but I need to translate the Python code to R)
P.S. I need to reduce the dimension of the data since my data has around 76 thousands columns and I am unable to run linear regression or any test on them.
CodePudding user response:
You need to remove the scaling from the scale()
function, i.e.
df <- iris[1:50, -5]
sapply(data.frame(scale(df)), var)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1 1 1 1
sapply(data.frame(scale(df, scale = FALSE)), var)
#Sepal.Length Sepal.Width Petal.Length Petal.Width
# 0.12424898 0.14368980 0.03015918 0.01110612