Home > Mobile >  How `sklearn.neighbors.KernelDensity` deal with overflow when high-dimensional datas? And how can I
How `sklearn.neighbors.KernelDensity` deal with overflow when high-dimensional datas? And how can I

Time:05-25

I'm trying to mimic "gaussian kernel density" computation of sklearn.neighbors.KernelDensity by tensorflow. There is no doubt that the computation will be accelerated by tf.function, if all operations can be converted to graph. Here is my relatively successful codes:

import tensorflow as tf 
import numpy as np
from sklearn.neighbors import KernelDensity

tf2pi = tf.constant(2*np.pi,dtype=tf.float64)
def log_gauss_norm(h,d):
    return -0.5*d*tf.math.log(tf2pi)-d*tf.math.log(h)
def gauss(x,d,h):
    y = log_gauss_norm(h,d)-0.5*tf.reduce_sum(x**2,axis=-1)
    return tf.math.exp(y)
@tf.function
def my_kde(x,data_array,bandwidth=2.):
    n_features = tf.cast(float(data_array.shape[-1]),tf.float64)
    bandwidth = tf.cast(bandwidth,tf.float64)
    assert len(x.shape)==2
    x = x[:,tf.newaxis,:]
    y = gauss((x-data_array)/bandwidth,d=n_features,h=bandwidth)
    y = tf.reduce_mean(y,axis=-1)
    return tf.math.log(y)
# succeed
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,40]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-73.09079498452077 -71.975842500329691]
y2 = kde.score_samples(basic[0:2])
print(y2)  # [-73.09079498 -71.9758425 ]
assert all(np.isclose(y1-y2,0.0)) 

# overflow
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,800]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-inf -inf]
y2 = kde.score_samples(basic[0:2])
print(y2) # [-1298.87891138 -1298.87891138]

The problem is, if deal with high-dimensional datas, such as 800 in above codes, my mimic got -inf while sklearn.neighbors.KernelDensity still work and just goto a meaningful lower bound. I want to mimic the lower bound feature. Even though I try to dig the source code and find the critical code is writed in _kde_single_breadthfirst() function in source codesklearn\neighbors\_binary_tree.pxi, I cannot understand this function. So, I draw here for help.

CodePudding user response:

I'm sorry that due to my lack of basic computer knowledge, at the beginning, I didn't understand that when estimating density, why to store data in tree structure. But now, I can classify this problem as how to mimic a kd-tree or ball-tree in tensorflow and then searching, calculating and dealing with boundary.

  • Related