Home > Net >  How do I access the n-grams produced by FeaturizeText in Microsoft.ML?
How do I access the n-grams produced by FeaturizeText in Microsoft.ML?

Time:02-10

I managed to get a first text analyser running in Microsoft.ML. I would like to get to the list of ngrams determined by the model, but I can only get the numerical vectors "counting" occurrences without knowing what they refer to.

Here is the core of my working code so far:

var mlContext = new MLContext();
var articles = SampleData.Articles.Select(a => new TextData{ Text=a }).ToArray();
var dataview = mlContext.Data.LoadFromEnumerable(articles);
var options = new TextFeaturizingEstimator.Options() {
  OutputTokensColumnName = "OutputTokens",
  CaseMode = TextNormalizingEstimator.CaseMode.Lower,
  KeepDiacritics = false,
  KeepPunctuations = false,
  KeepNumbers = false,
  Norm = TextFeaturizingEstimator.NormFunction.L2,
  StopWordsRemoverOptions = new StopWordsRemovingEstimator.Options() {
    Language = TextFeaturizingEstimator.Language.Dutch,
  },
  WordFeatureExtractor = new WordBagEstimator.Options() {
    NgramLength = 4,
    SkipLength = 1,
    UseAllLengths = true,
    MaximumNgramsCount = new int[] { 20, 10, 10, 10 },
    Weighting = NgramExtractingEstimator.WeightingCriteria.TfIdf,
  },
  CharFeatureExtractor = null,
};
var textPipeline = mlContext.Transforms.Text   
  .FeaturizeText("Features", options, "Text");
var textTransformer = textPipeline.Fit(dataview);
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(textTransformer);
foreach (var article in articles)
{
  var prediction = predictionEngine.Predict(article);
  Console.WriteLine($"Article: {article.Text.Substring(0, 30)}...");
  Console.WriteLine($"Number of Features: {prediction.Features.Length}");
  Console.WriteLine($"Features: {string.Join(",", prediction.Features.Take(50).Select(f => f.ToString("0.00")))}\n");
}

CodePudding user response:

  •  Tags:  
  • Related