splitting data for machine learning-CodePudding

I have a dataframe that includes texts and i want to split the data according to the "writer" column for the machine learning process. For example, I want to train with texts from Aeschylus and Sophocles and test with the texts from Euripides. How can I do that? I am using sklearn.

CodePudding user response：

Try to adapt your code according to this:

authors = ["Aesch", "Soph", "Euri", "Aesch", "Soph", "Euri"]
df = pd.DataFrame(authors, columns=["author"])
df["text"] = ["abc", "bcd", "cde", "abc", "bcd", "cde"]

# split your dataframe with a condition
train = df[df.author!="Euri"]
test = df[df.author=="Euri"]

CodePudding user response：

That's what GroupKFold is for, it takes the group column additionally to features and target:

group_kfold = GroupKFold(n_splits=2)
X_train, X_test, y_train, y_test = group_kfold.split(X, y, group)

See documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold