Why is doing softmax and crossentropy separately produce different result than doing them together using softmax_cross_entropy_with_logits?

Question

I was making a computer to predict a hand-written number from MNist data set using softmax function. and something weird happened. the cost was decreasing over time and eventually becomes something around 0.0038....(I used softmax_crossentropy_with_logits() for the cost function) However, the accuracy was pretty as low as 33%. So I thought "well.. I don't know what happened there but if I do softmax and crossentropy separately maybe it will produce different result !" and boom ! accuracy went up to 89 %. I have no idea why doing softmax and crossentropy separately makes such different result. I even looked up here :difference between tensorflow tf.nn.softmax and tf.nn.softmax_cross_entropy_with_logits

so this is the code that I used softmax_cross_entropy_with_logits() for the cost function (accuracy: 33%)

import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST_data", one_hot=True)

X = tf.placeholder(shape=[None,784],dtype=tf.float32)
Y = tf.placeholder(shape=[None,10],dtype=tf.float32)

W1= tf.Variable(tf.random_normal([784,20]))
b1= tf.Variable(tf.random_normal([20]))
layer1 = tf.nn.softmax(tf.matmul(X,W1)+b1)

W2 = tf.Variable(tf.random_normal([20,10]))
b2 = tf.Variable(tf.random_normal([10]))

logits = tf.matmul(layer1,W2)+b2
hypothesis = tf.nn.softmax(logits) # just so I can figure our the accuracy 

cost_i= tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=Y)
cost = tf.reduce_mean(cost_i)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)


batch_size  = 100
train_epoch = 25
display_step = 1
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())

    for epoch in range(train_epoch):
        av_cost = 0
        total_batch = int(mnist.train.num_examples / batch_size)
        for batch in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            sess.run(optimizer,feed_dict={X:batch_xs,Y:batch_ys})
        av_cost  += sess.run(cost,feed_dict={X:batch_xs,Y:batch_ys})/total_batch
        if epoch % display_step == 0:  # Softmax
            print ("Epoch:", '%04d' % (epoch + 1), "cost=", "{:.9f}".format(av_cost))
    print ("Optimization Finished!")

    correct_prediction = tf.equal(tf.argmax(hypothesis,1),tf.argmax(Y,1))
    accuray = tf.reduce_mean(tf.cast(correct_prediction,'float32'))
    print("Accuracy:",sess.run(accuray,feed_dict={X:mnist.test.images,Y:mnist.test.labels}))

and this is the one that I did softmax and cross_entropy separately(accuracy: 89%)

import tensorflow as tf  #89 % accuracy one 
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST_data", one_hot=True)

X = tf.placeholder(shape=[None,784],dtype=tf.float32)
Y = tf.placeholder(shape=[None,10],dtype=tf.float32)

W1= tf.Variable(tf.random_normal([784,20]))
b1= tf.Variable(tf.random_normal([20]))
layer1 = tf.nn.softmax(tf.matmul(X,W1)+b1)

W2 = tf.Variable(tf.random_normal([20,10]))
b2 = tf.Variable(tf.random_normal([10]))


#logits = tf.matmul(layer1,W2)+b2
#cost_i= tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=Y)

logits = tf.matmul(layer1,W2)+b2

hypothesis = tf.nn.softmax(logits)
cost = tf.reduce_mean(tf.reduce_sum(-Y*tf.log(hypothesis)))


optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

batch_size  = 100
train_epoch = 25
display_step = 1
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())

    for epoch in range(train_epoch):
        av_cost = 0
        total_batch = int(mnist.train.num_examples / batch_size)
        for batch in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            sess.run(optimizer,feed_dict={X:batch_xs,Y:batch_ys})
        av_cost  += sess.run(cost,feed_dict={X:batch_xs,Y:batch_ys})/total_batch
        if epoch % display_step == 0:  # Softmax
            print ("Epoch:", '%04d' % (epoch + 1), "cost=", "{:.9f}".format(av_cost))
    print ("Optimization Finished!")

    correct_prediction = tf.equal(tf.argmax(hypothesis,1),tf.argmax(Y,1))
    accuray = tf.reduce_mean(tf.cast(correct_prediction,'float32'))
    print("Accuracy:",sess.run(accuray,feed_dict={X:mnist.test.images,Y:mnist.test.labels}))

score 2 · Accepted Answer · answered May 12 '17 at 06:38

2

If you use tf.reduce_sum() in the upper example, as you did in the lower one, you should be able to achieve similar results with both methods: cost = tf.reduce_mean(tf.reduce_sum( tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))).

I increased the number of training epochs to 50 and achieved accuracies of 93.06% (tf.nn.softmax_cross_entropy_with_logits()) and 93.24% (softmax and cross entropy separately), so the results are quite similar.

answered May 12 '17 at 06:38

ml4294

2,433
5
21
22

It works like a charm I thought cost_i= tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=Y) and tf.reduce_sum(-Y*tf.log(hypothesis)) was a same thing – Kanna Kim May 12 '17 at 14:25

score 2 · Answer 2 · answered May 12 '17 at 07:51

From Tensorflow API here the second way, cost = tf.reduce_mean(tf.reduce_sum(-Y*tf.log(hypothesis))) is numerically unstable, and because of that you can't get same results,

Whatever, you can find on my GitHub the implementation of numerically stable cross entropy loss function which has the same result as tf.nn.softmax_cross_entropy_with_logits() function.

You can see that tf.nn.softmax_cross_entropy_with_logits() doesn't calculate the large numbers softmax normalization, on only approximate them, more details are in README section.

Why is doing softmax and crossentropy separately produce different result than doing them together using softmax_cross_entropy_with_logits?

2 Answers2