How to get Vocabulary with weights for tf-idf word bags in ml.net?

Question

The documentation of ML.NET shows how to use context.Transforms.Text.ProduceWordBags to get word bags. The method takes Transforms.Text.NgramExtractingEstimator.WeightingCriteria as one of the parameters, so it's possible to request TfIdf weights to be used. The simplest example would be:

// Get a small dataset as an IEnumerable and then read it as a ML.NET data set.
IEnumerable<SamplesUtils.DatasetUtils.SampleTopicsData> data = SamplesUtils.DatasetUtils.GetTopicsData();
var trainData = ml.Data.LoadFromEnumerable(data);

var pipeline = ml.Transforms.Text.ProduceWordBags("bags", review, ngramLength: 1, weighting: Transforms.Text.NgramExtractingEstimator.WeightingCriteria.TfIdf);

var transformer = pipeline.Fit(trainData);
var transformed_data = transformer.Transform(trainData);

That's all fine, but how do I get the actual results out of transformed_data?

I did some digging in a debugger, but I'm still quite confused on what's actually happening here.

First of all, running the pipeline adds three extra columns to transformed_data:

After getting a preview of the data I can see what's in these columns. To make things clearer here's what GetTopicsData returns, which is what we're running our transform on:

animals birds cats dogs fish horse
horse birds house fish duck cats
car truck driver bus pickup
car truck driver bus pickup horse

That's exactly what I'm seeing in the very first bags column, typed as Vector<string>:

Moving on to the second bags column, typed as Vector<Key<UInt32, 0-12>> (no idea what 0-12 is here btw.).

This one has KeyValues annotation on it and it looks like for each row it maps the words into indexes in global Vocabulary array.

The Vocabulary array is part of Annotations:

So that's promissing. You'd think the last bags column, typed as Vector<Single, 13> would have the weights for each of the words! Unfortunately, that's not what I'm seeing. First of all, the same Vocabulary array is present in Annotations:

And the values in rows are 1/0, which is not what TfIdf should return:

So to me that looks more like "Is word i from the Vocabulary present in current row" and not the TfIdf frequency of it, which is what I'm trying to get.

OK. I think I got it. There is a bug in ML.NET which basically ignores `weighting` parameter in this scenarios and always uses `Tf`. — MarcinJuraszek, Mar 28 '19 at 22:32

How to get Vocabulary with weights for tf-idf word bags in ml.net?

0 Answers0