Home > front end >  creating training.csv and test.csv file after splitting a dataset using sklearn
creating training.csv and test.csv file after splitting a dataset using sklearn

Time:06-25

I am working on iris dataset. I was able to split the dataset with training and test set.

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = .3, random_state = 50)

Now I want to extract two individual csv files one for training dataset and another one for test dataset.

training_set.csv will contain X_train and Y_train.

test_set.csv will contain X_test and Y_test.

I have tried this code block

training_set = pd.DataFrame(X_train, Y_train)

Which retured

   sepal.width  petal.length    petal.width
   variety          
   Setosa   NaN NaN NaN
   Setosa   NaN NaN NaN
   Setosa   NaN NaN NaN
   Virginica    NaN NaN NaN
   Virginica    NaN NaN NaN
   ...  ... ... ...
  Versicolor    NaN NaN NaN
  Virginica NaN NaN NaN
  Setosa    NaN NaN NaN
  Virginica NaN NaN NaN
  Virginica NaN NaN NaN
   105 rows × 3 columns

How should I proceed?

Thank you.

CodePudding user response:

From my answer here, load the dataset and convert it to a dataframe:

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()

df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
                  columns= iris['feature_names']   ['target']).astype({'target': int}) \
       .assign(species=lambda x: x['target'].map(dict(enumerate(iris['target_names']))))

X_train, X_test, y_train, y_test = \
    train_test_split(df.iloc[:, :4], df['species'], test_size=.3, random_state=50)

training_set = pd.concat([X_train, y_train], axis=1)
test_set = pd.concat([X_test, y_test], axis=1)

training_set.to_csv('training.csv')
test_set.to_csv('test.csv')

Note: you can use target (int) or species (str) column as y vector.

CodePudding user response:

IIUC, you trying to save the test and training dataset into a csv. is that correct?

did you try this and it doesn't work?

pd.DataFrame(X_train, Y_train).to_csv('training.csv')

OR

training_set.to_csv('training.csv')
  • Related