3

Given:

x_batch = torch.tensor([[-0.3, -0.7], [0.3, 0.7], [1.1, -0.7], [-1.1, 0.7]])

and then applying torch.sigmoid(x_batch):

tensor([[0.4256, 0.3318],
        [0.5744, 0.6682],
        [0.7503, 0.3318],
        [0.2497, 0.6682]])

gives a completely different result to torch.softmax(x_batch,dim=1):

tensor([[0.5987, 0.4013],
        [0.4013, 0.5987],
        [0.8581, 0.1419],
        [0.1419, 0.8581]])

As per my understanding, isn't the softmax is exactly the same as the sigmoid in the binary case?

iacob
  • 7,935
  • 4
  • 26
  • 52
CutePoison
  • 2,118
  • 1
  • 13
  • 22

2 Answers2

5

You are misinformed. Sigmoid and softmax are not equal, even for the 2 element case.

Consider x = [x1, x2].

sigmoid(x1) = 1 / (1 + exp(-x1))

but

softmax(x1) = exp(x1) / (exp(x1) + exp(x2))
            = 1 / (1 + exp(-x1)/exp(-x2))
            = 1 / (1 + exp(-(x1 - x2))
            = sigmoid(x1 - x2)

From the algebra we can see an equivalent relationship is

softmax(x, dim=1) = sigmoid(x - fliplr(x))

or in pytorch

x_softmax = torch.sigmoid(x_batch - torch.flip(x_batch, dims=(1,))
iacob
  • 7,935
  • 4
  • 26
  • 52
jodag
  • 11,876
  • 3
  • 28
  • 45
  • According to Bishop (Pattern recognition): `p(C1|x)=p(x|C_1)/(p(x|C1)*p(C1)+p(x|C2)*P(C2))` which is equal to `1/(1+exp(-a)` (sigmoid) In the multiclass problem is `p(Ck|x)=p(Ck|x)p(Ck)/` which is, for k=1 and j=2 the sigmoid – CutePoison Oct 25 '19 at 11:20
  • I don't understand what Bayes theorem has to do with this question, but I doubt Bishop claims that softmax of a vector is identical to applying the sigmoid function to each element of that vector. – jodag Oct 25 '19 at 14:42
  • I am not sure about Bishop, but even Andrew Ng mentions in his deeplearning.ai course that softmax reduces to sigmoid for binary classification. – akshayk07 Oct 27 '19 at 04:54
  • 2
    I showed in this answer that softmax is equivalent to sigmoid in a sense. Its equivalent to the sigmoid of the difference of logits, but not the sigmoid of the logits. – jodag Oct 27 '19 at 05:07
1

When it's said that the Softmax is a multivariate generalisation of the "sigmoid" (i.e. logistic) function, the scalar logistic function is interpreted as a 2d function where the arguments () are scaled by (and hence the first is fixed at enter image description here), evaluated at enter image description here.

Since the softmax function is translation invariant,1 this does not affect the output:

The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the (x, y) plane. One variable is fixed at 0 (say z_{2}=0), so e^0 = 1, and the other variable can vary, denote it {\displaystyle z_{1}=x}, so

{\textstyle e^{z_{1}}/\sum {k=1}^{2}e^{z{k}}=e^{x}/(e^{x}+1),}, the standard logistic function, and

{\textstyle e^{z_{2}}/\sum {k=1}^{2}e^{z{k}}=1/(e^{x}+1),}, its complement (meaning they add up to 1).

Hence, if you wish to use PyTorch's scalar sigmoid as a 2d Softmax function you must manually scale the input (enter image description here) and take the complement:

enter image description here


# Scale values relative to x0
x_batch_scaled = x_batch - x_batch[:,0].unsqueeze(1)

###############################
# The following are equivalent
###############################

# Softmax
torch.softmax(x_batch, dim=1)

# Softmax with all inputs scaled
torch.softmax(x_batch_scaled, dim=1)

# Sigmoid (and complement) with inputs scaled
torch.stack([1 - torch.sigmoid(x_batch_scaled[:,1]), 
             torch.sigmoid(x_batch_scaled[:,1])], dim=1)
tensor([[0.5987, 0.4013],
        [0.4013, 0.5987],
        [0.8581, 0.1419],
        [0.1419, 0.8581]])

tensor([[0.5987, 0.4013],
        [0.4013, 0.5987],
        [0.8581, 0.1419],
        [0.1419, 0.8581]])

tensor([[0.5987, 0.4013],
        [0.4013, 0.5987],
        [0.8581, 0.1419],
        [0.1419, 0.8581]])

  1. More generally, softmax is invariant under translation by the same value in each coordinate: adding {\displaystyle \mathbf {c} =(c,\dots ,c)} to the inputs \mathbf {z} yields {\displaystyle \sigma (\mathbf {z} +\mathbf {c} )=\sigma (\mathbf {z} )}, because it multiplies each exponent by the same factor, {\displaystyle e^{c}} (because }{\displaystyle e^{z_{i}+c}=e^{z_{i}}\cdot e^{c}}), so the ratios do not change:

    {\displaystyle \sigma (\mathbf {z} +\mathbf {c} ){j}={\frac {e^{z{j}+c}}{\sum {k=1}^{K}e^{z{k}+c}}}={\frac {e^{z_{j}}\cdot e^{c}}{\sum {k=1}^{K}e^{z{k}}\cdot e^{c}}}=\sigma (\mathbf {z} )_{j}.}

iacob
  • 7,935
  • 4
  • 26
  • 52