1

I am trying to teach the agent to play ATARI Space Invaders video game, but my Q values overshoot. I have clipped positive rewards to 1 (agent also receives -1 for losing a life), so the maximum expected return should be around 36 (maybe I am wrong about this). I have also implemented the Huber loss. I have noticed that when my Q values start overshooting, the agent stops improving (the reward stops increasing).

Code can be found here

Plots can be found here

Note: I have binarized frames, so that I can use bigger replay buffer (my replay buffer size is 300 000 which is 3 times smaller than in original paper)

EDIT: I have binarized frames so I can use 1 bit(instead of 8 bits) to store one pixel of the image in the replay buffer, using numpy.packbits function. In that way I can use 8 times bigger replay buffer. I have checked if image is distorted after packing it with packbits, and it is NOT. So sampling from replay buffer works fine. This is the main loop of the code (maybe the problem is in there):

    frame_count = 0
    LIFE_CHECKPOINT = 3
    for episode in range(EPISODE,EPISODES):
        # reset the environment and init variables
        frames, _, _ = space_invaders.resetEnv(NUM_OF_FRAMES)
        state = stackFrames(frames)
        done = False
        episode_reward = 0
        episode_reward_clipped = 0
        frames_buffer = frames # contains preprocessed frames (not stacked)
        while not done:
            if (episode % REPORT_EPISODE_FREQ == 0):
                space_invaders.render()
            # select an action from behaviour policy
            action, Q_value, is_greedy_action = self.EGreedyPolicy(Q, state, epsilon, len(ACTIONS))
            # perform action in the environment
            observation, reward, done, info = space_invaders.step(action)
            episode_reward += reward # update episode reward
            reward, LIFE_CHECKPOINT = self.getCustomReward(reward, info, LIFE_CHECKPOINT)
            episode_reward_clipped += reward
            frame = preprocessFrame(observation, RESOLUTION)
            # pop first frame from the buffer, and add new at the end (s1=[f1,f2,f3,f4], s2=[f2,f3,f4,f5])
            frames_buffer.append(frame) 
            frames_buffer.pop(0)
            new_state = stackFrames(frames_buffer)
            # add (s,a,r,s') tuple to the replay buffer
            replay_buffer.add(packState(state), action, reward, packState(new_state), done)

            state = new_state # new state becomes current state
            frame_count += 1
            if (replay_buffer.size() > MIN_OBSERVATIONS): # if there is enough data in replay buffer
                Q_values.append(Q_value)
                if (frame_count % TRAINING_FREQUENCY == 0):
                    batch = replay_buffer.sample(BATCH_SIZE)
                    loss = Q.train_network(batch, BATCH_SIZE, GAMMA, len(ACTIONS))
                    losses.append(loss)
                    num_of_weight_updates += 1
                if (epsilon > EPSILON_END):
                    epsilon = self.decayEpsilon(epsilon, EPSILON_START, EPSILON_END, FINAL_EXPLORATION_STATE)
            if (num_of_weight_updates % TARGET_NETWORK_UPDATE_FREQ == 0) and (num_of_weight_updates != 0): # update weights of target network
                Q.update_target_network() 
                print("Target_network is updated!")
        episode_rewards.append(episode_reward)

I have also checked the Q.train_network and Q.update_target_network functions and they work fine.

I was wondering if problem can be in hyper parameters:

ACTIONS = {"NOOP":0,"FIRE":1,"RIGHT":2,"LEFT":3,"RIGHTFIRE":4,"LEFTFIRE":5}
NUM_OF_FRAMES = 4 # number of frames that make 1 state
EPISODES = 10000 # number of episodes
BUFFER_SIZE = 300000 # size of the replay buffer(can not put bigger size, RAM)
MIN_OBSERVATIONS = 30000
RESOLUTION = 84 # resolution of frames
BATCH_SIZE = 32
EPSILON_START = 1 # starting value for the exploration probability
EPSILON_END = 0.1 
FINAL_EXPLORATION_STATE = 300000 # final frame for which epsilon is decayed
GAMMA = 0.99 # discount factor
TARGET_NETWORK_UPDATE_FREQ = 10000 
REPORT_EPISODE_FREQ = 100
TRAINING_FREQUENCY = 4
OPTIMIZER = RMSprop(lr=0.00025,rho=0.95,epsilon=0.01)
  • Which version of double Q-learning do you use? (I didn't look at your code, sorry). There are many. For instance you can use Q2 in the TD target of Q1 (I think this was the original version), or use min(Q1,Q2) (as in TD3). And which Q do you use for the greedy policy? You could use a fixed Q (usually Q1), a random one, or the min again. If Q overshoots, it's most likely because of overestimation bias, and having 2 Q should already help, so that's weird. – Simon Jun 08 '20 at 20:47
  • I am using Q1 for predicting the greedy action (the Q1 network is constanly updated) and I am calculating targets by the formula r + (1-dones)*gamma*max(Q2(s2, a)). The target network Q2 is updated every 40 000 weight updates (in the original paper frequency is 10 000 weight updates, but i get worse results with it) – Heisenberg666 Jun 08 '20 at 21:15
  • I was wondering if bias can be a consequence of negative rewards, which agent receives when he loses a life (so in every episode expected return is lowered by 3). Maybe the Q1 has learnt the expected return lowered by 3, but Q2 has not (which leads to bigger targets, which Q1 chases). – Heisenberg666 Jun 08 '20 at 21:26
  • Why updating Q2 less times than Q1? (every 40k steps?) Which paper is this? Using Q2 as target for Q1 is what the first double DQN paper says https://papers.nips.cc/paper/3964-double-q-learning.pdf They are both updated every step (randomly, either Q1 or Q2). I don't think negative reward is an issue. To avoid overestimation you could also limit Q output with a tanh output layer. Since reward is clipped in [-1,1], Q max is 1/(1-gamma), and Q min -1/(1-gamma). So your new Q would be tanh(Q)/(1-gamma). – Simon Jun 08 '20 at 21:33
  • This is the [paper](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf). They have linear activation and the Q values do not overshoot. – Heisenberg666 Jun 08 '20 at 21:44
  • Oh yes, the DQN paper. Be sure to implement also all the tricks (not sure they mention everything in the paper). There is this https://deepmind.com/research/open-source/dqn official release, but it's 3.0 and they may do things differently from the paper. You should check it and see it you are doing the same things. – Simon Jun 08 '20 at 22:26

0 Answers0