Home > Net >  How to split dataframe for scikit
How to split dataframe for scikit

Time:12-06

I have a big dataframe, how can I divide it into 80% and 20% for test and train Thanks!

I tried split but it didn't work

CodePudding user response:

from sklearn.model_selection import train_test_split
X = #define X columns
y = #defone y columns(target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

X_train and y_train, which contain 80% of the data, and X_test and y_test, which contain the remaining 20%

CodePudding user response:

To split a DataFrame into a training set and a test set, you can use the sklearn.model_selection.train_test_split() function from the scikit-learn library.

Here is an example of how you can use this function to split a DataFrame into an 80% training set and a 20% test set:

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(df, test_size=0.2)

You can also specify the random_state parameter to control the randomness of the split. This is useful for reproducibility and for ensuring that the same split is generated every time the code is run. Here is an example:

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

In this code, the random_state is set to 42, which means that the split will be generated using a random seed of 42. This will ensure that the same split is generated every time the code is run.

Finally, you can also specify the stratify parameter to ensure that the training and test sets have the same proportions of target classes. This is useful for classification problems where you want to ensure that the training and test sets are representative of the overall distribution of target classes in the data. Here is an example:

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(df, test_size=0.2, stratify=df['target'])

In this code, the stratify parameter is set to the target column of the DataFrame, which means that the training and test sets will have the same proportions of target classes as the original DataFrame. This will help ensure that the training and test sets are representative of the overall distribution of target classes in the data.

  • Related