how to save space training-CodePudding

I have written an intent classification program. This is first trained with training data and then tested with test data. The training process takes a few seconds. What is the best way to save such a training, so that it does not have to be trained again with every call? Is it enough to save train_X and train_y? or does the model have to be saved somehow?

import numpy as np
import pandas as pd
import os
import spacy
import csv
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC


# read data from csv
def read_data(path):
    with open(path, 'r') as csvfile:
        readcsv = csv.reader(csvfile, delimiter=',')
        labels = []
        sentences = []
        for row in readcsv:
            label = row[0]
            sentence = row[1]
            labels.append(label)
            sentences.append(sentence)
    return sentences, labels


# Loading Test Data

sentences_test, labels_test = read_data('./a_test.csv')

# print out the first two rows
print(sentences_test[:2], '\n')
print(labels_test[:2])

# Loading Training Data
sentences_train, labels_train = read_data('./a_train.csv')

# Load the spacy model: nlp
nlp = spacy.load('en_core_web_lg')
embedding_dim = nlp.vocab.vectors_length
print(embedding_dim)


def encode_sentences(sentences):
    # Calculate number of sentences
    n_sentences = len(sentences)

    print('Length :-', n_sentences)

    X = np.zeros((n_sentences, embedding_dim))
    # y = np.zeros((n_sentences, embedding_dim))

    # Iterate over the sentences
    for idx, sentence in enumerate(sentences):
        # Pass each sentence to the nlp object to create a document
        doc = nlp(sentence)
        # Save the document's .vector attribute to the corresponding row in
        # X
        X[idx, :] = doc.vector
    return X


train_X = encode_sentences(sentences_train)
test_X = encode_sentences(sentences_test)


# every label gets his own number
def label_encoding(labels):
    # Calculate the length of labels
    n_labels = len(labels)
    print('Number of labels :-', n_labels)

    le = LabelEncoder()
    y = le.fit_transform(labels)

    print(y[:100])
    print('Length of y :- ', y.shape)
    return y


train_y = label_encoding(labels_train)
test_y = label_encoding(labels_test)

df1 = pd.read_csv('./a_train.csv', delimiter=',')
df1.dataframeName = 'a_train.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')


df1.sample(10)
df1.describe()


# X_train and y_train was given.
def svc_training(X, y):
    # Create a support vector classifier
    clf = SVC(C=1)

    # Fit the classifier using the training data
    clf.fit(X, y)

    return clf


model = svc_training(train_X, train_y)
print(model.predict(train_X))

# Validation Step
def svc_validation(model, X, y):
    # Predict the labels of the test set
    y_pred = model.predict(X)

    # Count the number of correct predictions
    n_correct = 0
    for i in range(len(y)):
        if y_pred[i] == y[i]:
            n_correct  = 1

    print("Predicted {0} correctly out of {1} training examples".format(n_correct, len(y)))


#svc_validation(model, train_X, train_y)
#svc_validation(model, test_X, test_y)

CodePudding user response：

From what I gather you are trying to save the data so every time you run the program you don't need to calculate it again. The best I can do is suggest you write the data to a dedicated text file to serve as a data dump.

with open("datadump.txt", w) as file:
  file.write(train_X)
  file.write(train_y)

CodePudding user response：

spaCy has methods to write any given model to and from disk.

Go model.to_disk(path) to store the model on your hard drive, then model.from_disk() to retrieve it. Let me know if this answers your question.