I am new to Machine Learning and trying to do some exercises to figure some stuff out.
I am reading a csv file into a dataframe as such:
df = pd.read_csv(path "tweets.csv", header=None)
df.head()
I then want to use this data frame (which is really just all my data) to do a train_test_split
. I was looking into it and found out that the way to do it is this:
# create dataset
X, y = make_blobs(n_samples=500)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(n_samples, test_size=0.20)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
However, I tried to make my dataset a np_array
by doing n_samples = df.to_numpy()
before the make_blobs
method, however, when I try to do this I get the following error:
ValueError: not enough values to unpack (expected 4, got 2)
I thought it was because the variable wasn't big enough but it is. So I'm a bit lost. I understand that it's expecting to be able to get 4 values out but only receiving 2, however, I took the code straight from the docs, so I guess I'm missing something or I'm misunderstanding.
Could someone point me in the right direction, please?
CodePudding user response:
To solve your problem, you have to provide a sample of your dataframe.
For your dataframe, you need to determine which columns are data (X) and which columns are labels (y)
X_cols = ['feat1', 'feat2', 'feat3']
y_cols = ['label']
X_train, X_test, y_train, y_test = \
train_test_split(df[X_cols], df[y_cols], test_size=0.2)
At the end of the operation:
X_train
has 80% ofdf[X_cols]
X_test
has 20% ofdf[X_cols]
y_train
has 80% ofdf[y_cols]
y_test
has 20% ofdf[y_cols]
Update
If you want to use the data created by make_blob
, use:
X, y = make_blobs(n_samples=500)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
CodePudding user response:
Seems like you have only provided one input array to train_test_split
which means it just splits this array into 2 depending on the testing ratio. In the example given in the docs, there are two inputs to train_test_split
(X and y) and the function splits each of them into two arrays (resulting in 4 values).
The sklearn function knows nothing about the arrays being provided. It randomly splits as many arrays as you provide into 2 arrays (train and test) depending on the ratio provided while guaranteeing that the order in which the rows are split is same for each array (i.e. you maintain the taining data, its labels, and any other metadata you have for the same in other arrays).