I have a very huge dataset and required to reduce the embedding of 768 dimension to 128dimension with TSNE. Since I have more than 1million rows, it takes more than weeks to complete dimension reduction on whole dataset, so I thought maybe I can separate the dataset into different parts and then perform each part separately. I do not have GPU so only CPU.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=128, init='pca', random_state=1001, perplexity=30, method='exact', n_iter=250, verbose=1)
X_tsne = tsne.fit_transform(df_dataset[:1000000]) # this will either fail or take a while (most likely overnight)
I am wondering whether my way is considered OK?
The above is not using split yet, but just load all the datasets. I just want to confirm whether splitting to multiple batches and then fit_transform each batch is the right way or not.
Also, I check the below link about whitening sentence representation but not sure whether does it work with my above method by replacing tsne with whitening. https://deep-ch.medium.com/dimension-reduction-by-whitening-bert-roberta-5e103093f782
CodePudding user response:
It probably depends what you're trying to do, but I suspect the answer is that it is the wrong thing to do.
Between different batches it would difficult to guarantee that the reduced dimension representations would be comparable, since they would have been optimised independently, not using the same data. So you could end up with data looking similar in the low-D representation, when they aren't similar in the original representation.
It seems like PCA might be more suited to you, since it's very fast. Or UMAP, since it is also fast, but additionally has some ways to work with batched data etc.