I have extracted word embeddings of 2 different texts (title and description) and want to train an XGBoost
model on both embeddings. Now embeddings are 200
in dimension each as can be seen below:
Now I was able to train the model on 1 embedding data and it worked perfectly like this:
x=df['FastText'] #training features
y=df['Category'] # target variable
#Defining Model
model = XGBClassifier(objective='multi:softprob')
#Evaluation metrics
score=['accuracy','precision_macro','recall_macro','f1_macro']
#Model training with 5 Fold Cross Validation
scores = cross_validate(model, np.vstack(x), y, cv=5, scoring=score)
Now I want to use both the features for training but it gives me an error if I pass 2 columns of df like this:
x=df[['FastText_Title','FastText']]
One solution I tried is adding both the embeddings like x1 x2 but it decreases accuracy significantly, How do I use both features in cross_validate
function, Kindly help.
CodePudding user response:
In the past for multiple inputs, I've done this:
features = ['FastText_Title', 'FastText']
x = df[features]
y = df['Category']
It is creating an array containing both datasets. I usually need to scale the data as well using MinMaxScaler once the new array has been made.
CodePudding user response:
According to the error you are getting, it seems that there is something wrong with types. Try this, it will convert your features to numeric and it should work:
df['FastText'] = pd.to_numeric(df['FastText'])
df['FastText_Title'] = pd.to_numeric(df['FastText_Title'])