How would you train a model with a dataset that has 4 matrices per row? Below is a minimal reproducible example with a (2rows, 4 matrices, 3 X 6 matrix) dataset to train.
import numpy as np
import xgboost as xgb
# from scipy.sparse import csr_matrix
x = [np.array([[[ 985. , 935. , 396. , 258.5, 268. , 333. ],
[ 968. , 1000. , 1048. , 237.5, 308.5, 359.5],
[ 350. , 336. , 422. , 182.5, 264.5, 291.5]],
[[ 867. , 863. , 512. , 511. , 485.5, 525. ],
[ 917. , 914. , 739. , 450. , 524.5, 571. ],
[ 663. , 656. , 768. , 352.5, 460. , 439. ]],
[[ 569. , 554. , 269. , 240. , 240. , 263.5],
[ 597. , 592. , 560. , 222. , 244.5, 290. ],
[ 390. , 377. , 457. , 154.5, 289.5, 272. ]],
[[2002. , 2305. , 3246. , 3586.5, 3421.5, 3410. ],
[2378. , 2374. , 1722. , 3351.5, 3524. , 3456. ],
[3590. , 3457. , 3984. , 2620. , 2736.5, 2290. ]]]),
np.array([[[ 412. , 521. , 642. , 735. , 847.5, 358.5],
[ 471. , 737. , 558. , 331.5, 324. , 317.5],
[ 985. , 935. , 396. , 258.5, 268. , 333. ]],
[[ 603. , 674. , 786. , 966. , 1048. , 605.5],
[ 657. , 810. , 789. , 582. , 573. , 569.5],
[ 867. , 863. , 512. , 511. , 485.5, 525. ]],
[[ 325. , 426. , 544. , 730.5, 804.5, 366.5],
[ 396. , 543. , 486. , 339.5, 334. , 331. ],
[ 569. , 554. , 269. , 240. , 240. , 263.5]],
[[3133. , 3808. , 3617. , 4194.5, 4098. , 3802. ],
[3479. , 3488. , 3854. , 3860. , 3778.5, 3643. ],
[2002. , 2305. , 3246. , 3586.5, 3421.5, 3410. ]]])]
y = [np.array(6), np.array(10)]
This is an attempt to convert the matrix into a DMatrix which results in an error. I've tried other solutions such as using a csr_matrix too.
A solution could be to turn this:
(2rows, 4 matrices, 3 X 6 matrix)
to
(2rows, ~10 length)
by applying dimensionality reduction to the matrices and reshaping it?
Unsure if this is the best solution?
# X = csr_matrix(x)
dtrain_xbg = xgb.DMatrix(x, label=y)
params = {'max_depth': 3, 'learning_rate': .05, 'min_child_weight' : 4, 'subsample' : 0.8}
model = xgb.train(dtrain=dtrain_xbg, params=params,num_boost_round=200)
CodePudding user response:
reshaping the array to a dataframe
data = pd.concat([pd.DataFrame(x[i].reshape(4,x[i][0].shape[0]*x[i][0].shape[1]).reshape(1,x[i].reshape(4,x[i][0].shape[0]*x[i][0].shape[1]).shape[0]*x[i].reshape(4,x[i][0].shape[0]*x[i][0].shape[1]).shape[1])) for i in range(len(x))], ignore_index=True)
reducing the dataframe from 100K to 100 columns
from sklearn.decomposition import PCA
pca = PCA(n_components=100)
principalComponents = pca.fit_transform(pca_data.fillna(0))
principalDf = pd.DataFrame(data = principalComponents)
principalDf