17

I would like to understand the rational behind the Spark's OneHotEncoder dropping the last category by default.

For example:

>>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"])
>>> ss = StringIndexer(inputCol="c",outputCol="c_idx")
>>> ff = ss.fit(fd).transform(fd)
>>> ff.show()
+----+---+-----+
|   x|  c|c_idx|
+----+---+-----+
| 1.0|  a|  0.0|
| 1.5|  a|  0.0|
|10.0|  b|  1.0|
| 3.2|  c|  2.0|
+----+---+-----+

By default, the OneHotEncoder will drop the last category:

>>> oe = OneHotEncoder(inputCol="c_idx",outputCol="c_idx_vec")
>>> fe = oe.transform(ff)
>>> fe.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(2,[0],[1.0])|
| 1.5|  a|  0.0|(2,[0],[1.0])|
|10.0|  b|  1.0|(2,[1],[1.0])|
| 3.2|  c|  2.0|    (2,[],[])|
+----+---+-----+-------------+

Of course, this behavior can be changed:

>>> oe.setDropLast(False)
>>> fl = oe.transform(ff)
>>> fl.show()
+----+---+-----+-------------+
|   x|  c|c_idx|    c_idx_vec|
+----+---+-----+-------------+
| 1.0|  a|  0.0|(3,[0],[1.0])|
| 1.5|  a|  0.0|(3,[0],[1.0])|
|10.0|  b|  1.0|(3,[1],[1.0])|
| 3.2|  c|  2.0|(3,[2],[1.0])|
+----+---+-----+-------------+

Question::

  • In what case is the default behavior desirable?
  • What issues might be overlooked by blindly calling setDropLast(False)?
  • What do the authors mean by the following statment in the documentation?

The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence linearly dependent.

Community
  • 1
  • 1
Corey
  • 1,705
  • 11
  • 22
  • 7
    I would recommend you to search for literature/articles about the `dummy variable trap` (and linear regression). – Aeck Sep 14 '16 at 22:09
  • 1
    @Aeck Thanks! Looks like the dummy variable trap is definitely the answer to this question, if someone cared to write a little about it... – Corey Sep 14 '16 at 22:54
  • 1
    @Corey Had related a problem where I was confused by not even knowing that dropping the last category was even a thing. Posted and answered a question about it that includes a bit more about the *dummy variable trap (DVT)* here: https://stackoverflow.com/a/51604166/8236733. But basically, ... dropping the last cat. value is done to avoid a DVT where one input variable can be predicted from the others (eg. don't need a 1hot encoding of `[isBoy, isGirl]` when an encoding `[isBoy]` would give the same info). The solution to the DVT is to drop one (not necessarily the last) of the cat. variables. – lampShadesDrifter Jul 31 '18 at 01:12

1 Answers1

10

According to the doc it is to keep the column independents:

A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. For example with 5 categories, an input value of 2.0 would map to an output vector of [0.0, 0.0, 1.0, 0.0]. The last category is not included by default (configurable via OneHotEncoder!.dropLast because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0]. Note that this is different from scikit-learn's OneHotEncoder, which keeps all categories. The output vectors are sparse.

https://spark.apache.org/docs/1.5.2/api/java/org/apache/spark/ml/feature/OneHotEncoder.html

Jaroslav Bezděk
  • 2,697
  • 2
  • 14
  • 29
Romain Jouin
  • 3,458
  • 1
  • 38
  • 64
  • 4
    Haha, points for being the least lazy and willing to write *something*. In case someone looks up this answer, here's some more info. The categorical features lead to effective intercepts. If you include a general intercept term, then the minimizer could add e.g. 0.5 to the intercept and -0.5 to all categories to get the same value of the cost function. To avoid this degeneracy, remove the intercept and include all categories. – Corey Mar 30 '17 at 17:35
  • Adding to this: For Scala api, use `.setFitIntercept(false)` on the logistic regression classifier to remove the intercept when including all categories! – aMKa Jun 22 '17 at 13:07
  • So you answered a question by quoting the same piece of text that was already in the question. – Jasper-M Apr 02 '18 at 15:20
  • 1
    @Corey, I think that the main take from this should be "be aware to this". To be on the safe side I'd suggest considering this in the model selection phase. – Oleg Jan 07 '20 at 08:59