I'm trying to mimic "gaussian kernel density" computation of sklearn.neighbors.KernelDensity
by tensorflow. There is no doubt that the computation will be accelerated by tf.function
, if all operations can be converted to graph. Here is my relatively successful codes:
import tensorflow as tf
import numpy as np
from sklearn.neighbors import KernelDensity
tf2pi = tf.constant(2*np.pi,dtype=tf.float64)
def log_gauss_norm(h,d):
return -0.5*d*tf.math.log(tf2pi)-d*tf.math.log(h)
def gauss(x,d,h):
y = log_gauss_norm(h,d)-0.5*tf.reduce_sum(x**2,axis=-1)
return tf.math.exp(y)
@tf.function
def my_kde(x,data_array,bandwidth=2.):
n_features = tf.cast(float(data_array.shape[-1]),tf.float64)
bandwidth = tf.cast(bandwidth,tf.float64)
assert len(x.shape)==2
x = x[:,tf.newaxis,:]
y = gauss((x-data_array)/bandwidth,d=n_features,h=bandwidth)
y = tf.reduce_mean(y,axis=-1)
return tf.math.log(y)
# succeed
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,40]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-73.09079498452077 -71.975842500329691]
y2 = kde.score_samples(basic[0:2])
print(y2) # [-73.09079498 -71.9758425 ]
assert all(np.isclose(y1-y2,0.0))
# overflow
np.random.seed(0)
basic = np.array(np.random.normal(0,1.0,size=[10000,800]),dtype=np.float64)
kde = KernelDensity(kernel='gaussian',bandwidth=2).fit(basic)
y1 = my_kde(basic[0:2],basic)
tf.print(y1) # [-inf -inf]
y2 = kde.score_samples(basic[0:2])
print(y2) # [-1298.87891138 -1298.87891138]
The problem is, if deal with high-dimensional datas, such as 800
in above codes, my mimic got -inf
while sklearn.neighbors.KernelDensity
still work and just goto a meaningful lower bound. I want to mimic the lower bound feature. Even though I try to dig the source code and find the critical code is writed in _kde_single_breadthfirst()
function in source codesklearn\neighbors\_binary_tree.pxi
, I cannot understand this function. So, I draw here for help.
CodePudding user response:
I'm sorry that due to my lack of basic computer knowledge, at the beginning, I didn't understand that when estimating density, why to store data in tree structure. But now, I can classify this problem as how to mimic a kd-tree or ball-tree in tensorflow and then searching, calculating and dealing with boundary.