I have pyspark dataframe like this:
------ ---------------------------------------------------------------------
|id |features |
------ ---------------------------------------------------------------------
|2484 |[0.016910851, 0.025989642, 0.0025321299, -0.022232508, -0.00701562] |
|2504 |[0.015019539, 0.024844216, 0.0029279909, -0.020771071, -0.0061111804]|
|2904 |[0.014104126, 0.02474243, 0.0011707658, -0.021675153, -0.0050868453] |
|3084 |[0.110674664, 0.17139696, 0.059836507, -0.1926481, -0.060425207] |
|3164 |[0.17688861, 0.2159168, 0.10567094, -0.17365277, -0.016458606] |
|377784|[0.18425785, 0.34397766, 0.022859085, -0.35151178, -0.07897296] |
|425114|[0.14556459, 0.25762737, 0.09011796, -0.27128243, 0.011280057] |
|455074|[0.13579306, 0.3266111, 0.016416805, -0.31139722, -0.054227617] |
|532624|[0.22281846, 0.1575731, 0.14126688, -0.29887098, -0.09433056] |
|781654|[0.1381407, 0.14674455, 0.06877328, -0.13415968, -0.06589967] |
------ ---------------------------------------------------------------------
Now I have to find nearest neighbor for this features so here are my step:
df_collect = df.toPandas()
#converting list column to array
df_collect['features'] = df_collect['features'].apply(lambda x: np.array(x))
features = df_collect['features'].to_numpy()
knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)
Now here I'm getting error:
TypeError Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
/tmp/ipykernel_6511/1498389666.py in <module>
----> 1 knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)
~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/neighbors/_unsupervised.py in fit(self, X, y)
164 The fitted nearest neighbors estimator.
165 """
--> 166 return self._fit(X)
~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y)
433 else:
434 if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
--> 435 X = self._validate_data(X, accept_sparse="csr")
436
437 self._check_algorithm_metric()
~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
564 raise ValueError("Validation should be done on X, y or both.")
565 elif not no_val_X and no_val_y:
--> 566 X = check_array(X, **check_params)
567 out = X
568 elif no_val_X and not no_val_y:
~/miniconda3/envs/dev_env_37/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
744 array = array.astype(dtype, casting="unsafe", copy=False)
745 else:
--> 746 array = np.asarray(array, order=order, dtype=dtype)
747 except ComplexWarning as complex_warning:
748 raise ValueError(
ValueError: setting an array element with a sequence.
I have checked all the size of subarray and everything's same and also data type. Can someone please point out what can be wrong here.
Output of features:
array([array([ 0.01691085, 0.02598964, 0.00253213, -0.02223251, -0.00701562]),
array([ 0.01501954, 0.02484422, 0.00292799, -0.02077107, -0.00611118]),
array([ 0.01410413, 0.02474243, 0.00117077, -0.02167515, -0.00508685]),
...,
array([ 0.01896316, 0.03188267, 0.00258667, -0.02800867, -0.00646481]),
array([ 0.03538242, 0.07453772, 0.00816828, -0.02914227, -0.0942148 ]),
array([ 0.02470775, 0.02561068, 0.00401011, -0.02863882, -0.00419102])],
dtype=object)
CodePudding user response:
I had to modify the table string to be able to convert it into a Pandas dataframe. The, this code works fine
from sklearn.neighbors import NearestNeighbors
from io import StringIO
import numpy as np
df_str = """2484, 0.016910851, 0.025989642, 0.0025321299, -0.022232508, -0.00701562
2504, 0.015019539, 0.024844216, 0.0029279909, -0.020771071, -0.0061111804
2904, 0.014104126, 0.02474243, 0.0011707658, -0.021675153, -0.0050868453
3084, 0.110674664, 0.17139696, 0.059836507, -0.1926481, -0.060425207
3164, 0.17688861, 0.2159168, 0.10567094, -0.17365277, -0.016458606
377784, 0.18425785, 0.34397766, 0.022859085, -0.35151178, -0.07897296
425114, 0.14556459, 0.25762737, 0.09011796, -0.27128243, 0.011280057
455074, 0.13579306, 0.3266111, 0.016416805, -0.31139722, -0.054227617
532624, 0.22281846, 0.1575731, 0.14126688, -0.29887098, -0.09433056
781654, 0.1381407, 0.14674455, 0.06877328, -0.13415968, -0.06589967"""
# convert to pandas frame
data = StringIO(df_str)
df = pd.read_csv(data, sep=",", names=['id'] ['feat_{}'.format(i) for i in range(1,6)])
#converting list column to array
features = df.drop(columns=['id']).to_numpy()
# fit kNN
knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)
# output
knnobj.get_params()
> {'algorithm': 'auto',
'leaf_size': 30,
'metric': 'minkowski',
'metric_params': None,
'n_jobs': None,
'n_neighbors': 5,
'p': 2,
'radius': 1.0}
Given the cryptic error message, my guess is that the conversion of df_collect
introduces an erroneous data format that throws of the kNN
.
CodePudding user response:
df.toPandas()
returns a column of lists. You need to convert this column of lists to a 2D array. When you do df_collect['features'].apply(lambda x: np.array(x)).to_numpy()
you get an array of arrays which is not the same as a 2D array. So you need
df_collect = df.toPandas()
features = np.array(df_collect.features.to_list())
knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(features)
As an alternative, you can directly pass the nested list to NearestNeighbors
:
knnobj = NearestNeighbors(n_neighbors=5, algorithm='auto').fit(df_collect.features.to_list())