I am using this code to split a data frame
df_80_split = df.sample(frac=0.8,random_state=200)
What I need is to get the remainder of the entries in the original into a new data frame so something like
df_20_split = df - df_80_split
What would be a good way to code that?
CodePudding user response:
Using sklearn's train_test_split()
is a really good method to use especially for large data sets.
#import sklearn method to split training data
from sklearn.model_selection import train_test_split
# using your variable names
df_80_split, df_20_split = train_test_split(df, test_size = 0.2, random_state = 200)
If you provide the target variable as well you can also split features from targets for training and validation.
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size = 0.2, random_state = 200)
Lots of detail in the docs
CodePudding user response:
Assuming the index values of your dataframe are all unique, a pure-Pandas solution would be:
df_20_split = df[~df.index.isin(df_80_split.index)]
Full code:
# Just sample data
df = pd.DataFrame({'a':[*'abcdefg']*1000}).sort_values('a').reset_index(drop=True)
# Split the data
df_80_split = df.sample(frac=0.8, random_state=200)
# Get the remainder
df_20_split = df[~df.index.isin(df_80_split.index)]
Output:
>>> df_80_split.shape
(5600, 1)
>>> df_20_split.shape
(1400, 1)
>>> 5600 1400
7000
>>> df.shape
(7000, 1)