Pandas Split dataframe and get remainder of data row-CodePudding

I am using this code to split a data frame

df_80_split = df.sample(frac=0.8,random_state=200)

What I need is to get the remainder of the entries in the original into a new data frame so something like

df_20_split = df - df_80_split

What would be a good way to code that?

CodePudding user response：

Using sklearn's train_test_split() is a really good method to use especially for large data sets.

#import sklearn method to split training data
from sklearn.model_selection import train_test_split

# using your variable names
df_80_split, df_20_split = train_test_split(df, test_size = 0.2, random_state = 200)

If you provide the target variable as well you can also split features from targets for training and validation.

X_train, X_test, y_train, y_test = train_test_split(
       features, target, test_size = 0.2, random_state = 200)

Lots of detail in the docs

CodePudding user response：

Assuming the index values of your dataframe are all unique, a pure-Pandas solution would be:

df_20_split = df[~df.index.isin(df_80_split.index)]

Full code:

# Just sample data
df = pd.DataFrame({'a':[*'abcdefg']*1000}).sort_values('a').reset_index(drop=True)

# Split the data
df_80_split = df.sample(frac=0.8, random_state=200)

# Get the remainder
df_20_split = df[~df.index.isin(df_80_split.index)]

Output:

>>> df_80_split.shape
(5600, 1)

>>> df_20_split.shape
(1400, 1)

>>> 5600   1400
7000

>>> df.shape
(7000, 1)