I built a text analysis model using C# and Microsoft ML library. The dataset supplied by Microsoft is good at predicting the value of some comment strings like Batteries not included
, it prints a negative for that and No batteries
, it also prints a negative prediction value for that. However I have tested it against values such as Not bad
and This is really bad
, it prints a prediction value of Positive
for both, which is not correct. Is there a bigger dataset text file that I can use to improve the accuracy of my model.
I implemented the tutorial from Microsoft documentation for Sentiment Analysis.
The dataset is pretty small60kb
for using to train Text Analysis models. The dataset name is yelp_labelled.txt
. It contains sample statements and each statement has either a value of 0 (Negative) or 1(Positive). Where can I find a larger dataset for training my Text Analysis prediction?
The code am using is below
using AnalysisSentiment;
using Microsoft.ML;
using Microsoft.ML.Data;
using static Microsoft.ML.DataOperationsCatalog;
//create a field to hold the data file
string _dataPath = "yelp_labelled.txt";
//initialize the context
MLContext mlContext = new MLContext();
TrainTestData splitDataView = LoadData(mlContext);
ITransformer model = BuildAndTrainModel(mlContext, splitDataView.TrainSet);
Evaluate(mlContext, model, splitDataView.TestSet);
UseModelWithSingleItem(mlContext, model);
TrainTestData LoadData(MLContext mlContext)
{
IDataView dataView = mlContext.Data.LoadFromTextFile<SentimentData>(_dataPath, hasHeader: false);
TrainTestData splitDataView = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
return splitDataView;
}
ITransformer BuildAndTrainModel(MLContext mlContext, IDataView splitTrainSet)
{
var estimator = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: nameof(SentimentData.SentimentText))
.Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features"));
Console.WriteLine("=============== Create and Train the Model ===============");
var model = estimator.Fit(splitTrainSet);
Console.WriteLine("=============== End of training ===============");
Console.WriteLine();
return model;
}
void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet)
{
Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
IDataView predictions = model.Transform(splitTestSet);
CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");
Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");
}
void UseModelWithSingleItem(MLContext mlContext, ITransformer model)
{
PredictionEngine<SentimentData, SentimentPrediction> predictionFunction = mlContext.Model.CreatePredictionEngine<SentimentData, SentimentPrediction>(model);
SentimentData sampleStatement = new SentimentData
{
SentimentText = "not bad"
};
var resultPrediction = predictionFunction.Predict(sampleStatement);
Console.WriteLine();
Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");
Console.WriteLine();
Console.WriteLine($"Sentiment: {resultPrediction.SentimentText} | Prediction: {(Convert.ToBoolean(resultPrediction.Prediction) ? "Positive" : "Negative")} | Probability: {resultPrediction.Probability} ");
Console.WriteLine("=============== End of Predictions ===============");
Console.WriteLine();
}
CodePudding user response:
- Transfer Learning: Since your dataset set is low, the best approach is to do pre-training on sentiment datasets like IMBD movie reviews, etc and then fine-tune on your dataset.
- However, the model you are using is a simple Logistic regression that does not support pre-training and fine-tuning. So you will have to change your underlining ML model to a Deep learning model.
- Add more similar data: If you cannot change the underlining Logistic regression model, then you can try adding IMDB dataset to your dataset and train from scratch and see if your model test performance improves. It might work because IMDB is a two-class (positive and negative) dataset and it looks to be very similar to your dataset.