I'm working on a multi-class classification problem in which I have my data categorized into 8 classes.
What I want to do is to extract out all the instances that are related to one classification from my training dataset and include in my testing dataset.
What I did until now is this:
# Generate some data
df = pd.DataFrame({
'x1': np.random.normal(0, 1, 100),
'x2': np.random.normal(2, 3, 100),
'x3': np.random.normal(4, 5, 100),
'y': np.random.choice([0, 1, 2, 3, 4, 5, 6, 7], 100)})
df.head(10)
# Output is as follows
# x1 x2 x3 y
# 0 -0.742347 -2.064889 2.979338 6
# 1 0.182298 6.366811 7.435432 7 <-- Instance no. 1 will be stored in (filtered_df) in the next step
# 2 -1.015937 -3.214670 8.544494 4
# 3 0.688138 1.938480 4.028213 6
# 4 0.397756 0.064590 9.186234 5
# 5 0.095368 -3.255433 1.010394 1
# 6 0.609087 6.783653 4.390600 6
# 7 -0.017803 -1.571393 6.539134 5
# 8 0.814820 4.535381 2.175285 0
# 9 -0.573918 -0.672416 0.826967 6
# Taking out instances that are classified as no "7" from the dataset
filtered_df = df[df['y']==7]
df.drop(df[df['y']==7].index, inplace=True)
df.head(10)
# Output is as follows
# x1 x2 x3 y
# 0 -0.742347 -2.064889 2.979338 6
# 2 -1.015937 -3.214670 8.544494 4 <-- Instance no. 1 is stored in (filtered_df) now
# 3 0.688138 1.938480 4.028213 6
# 4 0.397756 0.064590 9.186234 5
# 5 0.095368 -3.255433 1.010394 1
# 6 0.609087 6.783653 4.390600 6
# 7 -0.017803 -1.571393 6.539134 5
# 8 0.814820 4.535381 2.175285 0
# 9 -0.573918 -0.672416 0.826967 6
# 11 0.044094 2.581373 1.368575 5
# Extract the features and target
X = df.iloc[:, 0:3]
y = df.iloc[:, 3]
# Spliting the dataset into train, test and validate for binary classification
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
## Not sure how to add (filtered_df) to X_test and y_test now ?
I'm not sure how to continue further. How can I add the instances that are stored in filtered_df
to x_test
and y_test
?
CodePudding user response:
IIUC:
for klass in df['y'].unique():
m = df['y'] != klass
X = df.loc[m, df.columns[:3]]
y = df.loc[m, df.columns[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)
X_test = X_test.append(df.loc[~m, df.columns[:3]])
y_test = y_test.append(df.loc[~m, df.columns[-1]])
# do stuff here
...
CodePudding user response:
Bit of an answer, bit of a request for further info (can't comment yet).
If you just want to combine the dataframes,.append()
is probably the way to go.
In your comments in the code block you say that you split the data set for binary classification but in the description of the problem you mention multi-class classification.
Two ways to approach this the way I see it:
either as a full on multi-class classification (in which case you shouldn't need to separate just one class for some reason) using a model that inherently supports multiple classes, or
a series of binary classifiers with the aim of ensembling them later (where you should create copies of the dataset for first class vs all, second class vs all etc., but sklearn already handles that with
sklearn.multiclass.OneVsRestClassifier
). You could then get probabilities from each of the models (if they are supported) and pick your final predicted class with the largest probability.
Just in case you haven't came across it already, maybe worth having a look in sklearn's "Multiclass and multioutput algorithms" page.