ValueError: setting an array element with a sequence while running NearestNeighbor-CodePudding

I have pyspark dataframe like this:

 ------ --------------------------------------------------------------------- 
|id    |features                                                             |
 ------ --------------------------------------------------------------------- 
|2484  |[0.016910851, 0.025989642, 0.0025321299, -0.022232508, -0.00701562]  |
|2504  |[0.015019539, 0.024844216, 0.0029279909, -0.020771071, -0.0061111804]|
|2904  |[0.014104126, 0.02474243, 0.0011707658, -0.021675153, -0.0050868453] |
|3084  |[0.110674664, 0.17139696, 0.059836507, -0.1926481, -0.060425207]     |
|3164  |[0.17688861, 0.2159168, 0.10567094, -0.17365277, -0.016458606]       |
|377784|[0.18425785, 0.34397766, 0.022859085, -0.35151178, -0.07897296]      |
|425114|[0.14556459, 0.25762737, 0.09011796, -0.27128243, 0.011280057]       |
|455074|[0.13579306, 0.3266111, 0.016416805, -0.31139722, -0.054227617]      |
|532624|[0.22281846, 0.1575731, 0.14126688, -0.29887098, -0.09433056]        |
|781654|[0.1381407, 0.14674455, 0.06877328, -0.13415968, -0.06589967]        |
 ------ ---------------------------------------------------------------------

Now I have to find nearest neighbor for this features so here are my step:

df_collect = df.toPandas()
#converting list column to array
df_collect['features'] = df_collect['features'].apply(lambda x: np.array(x))
features = df_collect['features'].to_numpy()

knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)

Now here I'm getting error:

TypeError                                 Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_6511/1498389666.py in <module>
----> 1 knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/neighbors/_unsupervised.py in fit(self, X, y)
    164             The fitted nearest neighbors estimator.
    165         """
--> 166         return self._fit(X)

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y)
    433         else:
    434             if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
--> 435                 X = self._validate_data(X, accept_sparse="csr")
    436 
    437         self._check_algorithm_metric()

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    744                     array = array.astype(dtype, casting="unsafe", copy=False)
    745                 else:
--> 746                     array = np.asarray(array, order=order, dtype=dtype)
    747             except ComplexWarning as complex_warning:
    748                 raise ValueError(

ValueError: setting an array element with a sequence.

I have checked all the size of subarray and everything's same and also data type. Can someone please point out what can be wrong here.

Output of features:

array([array([ 0.01691085,  0.02598964,  0.00253213, -0.02223251, -0.00701562]),
       array([ 0.01501954,  0.02484422,  0.00292799, -0.02077107, -0.00611118]),
       array([ 0.01410413,  0.02474243,  0.00117077, -0.02167515, -0.00508685]),
       ...,
       array([ 0.01896316,  0.03188267,  0.00258667, -0.02800867, -0.00646481]),
       array([ 0.03538242,  0.07453772,  0.00816828, -0.02914227, -0.0942148 ]),
       array([ 0.02470775,  0.02561068,  0.00401011, -0.02863882, -0.00419102])],
      dtype=object)

CodePudding user response：

I had to modify the table string to be able to convert it into a Pandas dataframe. The, this code works fine

from sklearn.neighbors import NearestNeighbors
from io import StringIO
import numpy as np

df_str = """2484, 0.016910851, 0.025989642, 0.0025321299, -0.022232508, -0.00701562 
2504, 0.015019539, 0.024844216, 0.0029279909, -0.020771071, -0.0061111804
2904, 0.014104126, 0.02474243, 0.0011707658, -0.021675153, -0.0050868453
3084, 0.110674664, 0.17139696, 0.059836507, -0.1926481, -0.060425207
3164, 0.17688861, 0.2159168, 0.10567094, -0.17365277, -0.016458606
377784, 0.18425785, 0.34397766, 0.022859085, -0.35151178, -0.07897296
425114, 0.14556459, 0.25762737, 0.09011796, -0.27128243, 0.011280057
455074, 0.13579306, 0.3266111, 0.016416805, -0.31139722, -0.054227617
532624, 0.22281846, 0.1575731, 0.14126688, -0.29887098, -0.09433056
781654, 0.1381407, 0.14674455, 0.06877328, -0.13415968, -0.06589967"""

# convert to pandas frame
data = StringIO(df_str)
df = pd.read_csv(data, sep=",", names=['id']   ['feat_{}'.format(i) for i in range(1,6)])

#converting list column to array
features = df.drop(columns=['id']).to_numpy()

# fit kNN
knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)

# output
knnobj.get_params()
> {'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'radius': 1.0}

Given the cryptic error message, my guess is that the conversion of df_collect introduces an erroneous data format that throws of the kNN.

CodePudding user response：

df.toPandas() returns a column of lists. You need to convert this column of lists to a 2D array. When you do df_collect['features'].apply(lambda x: np.array(x)).to_numpy() you get an array of arrays which is not the same as a 2D array. So you need

df_collect = df.toPandas()
features = np.array(df_collect.features.to_list())
knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)

As an alternative, you can directly pass the nested list to NearestNeighbors:

knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(df_collect.features.to_list())