How to use the watchdog timer in a RTOS?

Question

Assume I have a cooperative scheduler in an embedded environment. I have many processes running. I want to utilize the watchdog timer so that I can detect when a process has stopped behaving for any reason and reset the processor.

In simpler applications with no RTOS I would always touch the watchdog from the main loop and this was always adequate. However, here, there are many processes that could potentially hang. What is a clean method to touch the watchdog timer periodically while ensuring that each process is in good health?

I was thinking that I could provide a callback function to each process so that it could let another function, which oversees all, know it is still alive. The callback would pass a parameter which would be the tasks unique id so the overseer could determine who was calling back.

Are we taking about a watchdog that's part of the RTOS or an actual hardware watchdog timer that the RTOS services? — John U, Nov 06 '12 at 13:09

score 28 · Accepted Answer · answered Nov 04 '12 at 15:19

One common approach is to delegate the watchdog kicking to a specific task (often either the highest-priority or the lowest priority, tradeoffs / motivations for each approach), and then have all other tasks "check in" with this task.

This way:

if an interrupt is hung (100% CPU), the kicker task won't run, you reset
if the kicker task is hung, you reset
if another task is hung, kicker task sees no check in, kicker task doesn't kick WDG, you reset

Now there are of course implementation details to consider. Some people have each task set its own dedicated bit (atomically) in a global variable; the kicker task checks this group of bit flags at a specific rate, and clears/resets when everyone has checked in (along with kicking the WDG, of course.) I eschew globals like the plague and avoid this approach. RTOS event flags provide a somewhat similar mechanism that is more elegant.

I typically design my embedded systems as event-driven systems. In this case, each tasks blocks at one specific place - on a message queue. All tasks (and ISRs) communicate with each other by sending events / messages. This way, you don't have to worry about a task not checking in because it's blocked on a semaphore "way down there" (if that doesn't make sense, sorry, without writing a lot more I can't explain it better).

Also there is the consideration - do tasks check in "autonomously" or do they reply/respond to a request from the kicker task. Autonomous - for example, once a second, each task receives an event in its queue "tell kicker task you're still alive". Reply-request - once a second (or whatever), kicker tasks tells everybody (via queues) "time to check in" - and eventually every task runs its queue, gets the request and replies. Considerations of task priorities, queueing theory, etc. apply.

There are 100 ways to skin this cat, but the basic principle of a single task that is responsible for kicking the WDG and having other tasks funnel up to the kicker task is pretty standard.

There is at least one other aspect to consider - outside the scope of this question - and that's dealing with interrupts. The method I described above will trigger WDG reset if an ISR is hogging the CPU (good), but what about the opposite scenario - an ISR has (sadly) become accidentally and inadvertantly disabled. In many scenarios, this will not be caught, and your system will still kick the WDG, yet part of your system is crippled. Fun stuff, that's why I love embedded development.

I think your aspect to consider is the downside of synchronizing tasks around only inter-task messaging. If an ISR is critical to operation of a task, that ISR should be sending that task a message (or blocking a semaphor or some other mechanism) which inhibits that task until the ISR runs. If the ISR stalls, then the task can't signal the watchdog task and you catch that event. I'm guessing this is what you were referring to as "way down there" though I'm not sure why you advocate avoiding it. — iheanyi, Dec 09 '20 at 23:02
The other thing here is just because a task does work with an ISR doesn't mean it should be blocked solely by the ISR. For example, a communication protocol task may rely on a UART ISR for byte reception. However, the task should not block waiting on the UART alone. In this case, it should block on both a UART and a timer (or some other event) since correct operation could be that there's nobody around to talk. [1/2] — iheanyi, Dec 09 '20 at 23:10
On the other hand, a task that needs ambient temperature readings which are supposed to come in 5-10X a second via ISR may solely block on that ISR since there isn't a valid condition where the SR doesn't run. Yes, you can't detect a failed ISR in the communication protocol case, but that's driven by not being able to disambiguate between a correctly running ISR w/ no communications and an ISR that fails to run. [2/2] — iheanyi, Dec 09 '20 at 23:12
@iheanyi -- your first comment -- yes, what I meant is that I tend to write code that blocks in 1 place -- at the top of the task loop -- and not block in other places (e.g. 3 levels deep of function calls in a semaphore that maybe never comes). By blocking in a single place, the task can stay responsive for "new things" while waiting for something to finish. — Dan, Dec 11 '20 at 03:07
@iheanyi - I agree about tasks not blocking only on an ISR, I hope you didn't interpret what I wrote to mean that. What I meant in my discussiion of ISRs is that my watchdog approach only catches hung tasks, not hung/dead ISRs. But a timer timeout would still be received and processed. I'm not sure that a timer is always the solution for UART RX, as the communication could be sporadic and unpredictable. If you work in low power, you don't want to wake many times only to find that there is nothing to do each time. I think we are on the same page. — Dan, Dec 11 '20 at 03:12
Thanks for the clarification - we're indeed on the same page. — iheanyi, Dec 11 '20 at 16:23

score 6 · Answer 2 · answered Nov 04 '12 at 16:58

One solution pattern:

Every thread that wishes to be checked explicitly registers its callback with the watchdog thread, which maintains a list of such callbacks.
When the watchdog is scheduled it may iterate the list of registered tasks
Each callback itself is called iteratively until it returns a healthy state.
At the end of the list the hardware watchdog is kicked.

That way any thread that never returns a healthy state will stall the watchdog task until the hardware watchdog timeout occurs.

In a preemptive OS, the watchdog thread would be the lowest priority or idle thread. In a cooperative scheduler, it should yield between call-back calls.

The design of the callback functions themselves depends on the specific task and its behaviour and periodicity. Each function can be tailored to the needs and characteristic of the task. Tasks of high periodicity might simply increment a counter, which is set to zero when the callback is called. If the counter is zero on entry, the task did not schedule since the last watchdog check. Tasks with low or aperiodic behaviour might time-stamp their scheduling, the callback might then return a failure if the task has not been scheduled for some specified time period. Both tasks and interrupt handlers might be monitored in this way. Moreover because it is the responsibility of a thread to register with the watchdog, you might have some threads that do not register at all.

score 2 · Answer 3 · answered Mar 12 '13 at 16:00

Each task should have its own simulated watchdog. And the real watchdog is feed by a high priority real-time task only if all simulated watchdogs have not timeout.

i.e:

void taskN_handler()
{
    watchdog *wd = watchdog_create(100); /* Create an simulated watchdog with timeout of 100 ms */
    /* Do init */
    while (task1_should_run)
    {
        watchdog_feed(wd); /* feed it */
        /* do stuff */
    }
    watchdog_destroy(wd); /* destroy when no longer necessary */
}

void watchdog_task_handler()
{
    int i;
    bool feed_flag = true;
    while(1)
    {
        /* Check if any simulated watchdog has timeout */
        for (i = 0; i < getNOfEnabledWatchdogs(); i++) 
        {
            if (watchogHasTimeout(i)) {
                   feed_flag = false;
                   break;
            }
         }

         if (feed_flag)
             WatchdogFeedTheHardware();

         task_sleep(10);
}

Now, one can say that the system is really protected, there will be no freezes, not even partial freezes, and mostly, no unwanted watchdog trigger.

Aki Suihkonen · Answer 4 · 2012-11-05T07:27:57.433

The traditional method is to have a watchdog process with the lowest possible priority

PROCESS(watchdog, PRIORITY_LOWEST) { while(1){reset_timer(); sleep(1);} }

And where the actual hardware timer resets the CPU every 3 or 5 seconds perhaps.

Tracking individual processes could be achieved by inverse logic: each process would setup a timer whose callback sends the watchdog a 'stop' message. Then each process would need to cancel the previous timer event and setup a new one somewhere in the 'receive event / message from queue' loop.

PROCESS(watchdog, PRIORITY_LOWEST) {
    while(1) { 
       if (!messages_in_queue()) reset_timer();
       sleep(1);
    }
}
void wdg_callback(int event) { 
    msg = new Message();
    send(&msg, watchdog);
};
PROCESS(foo, PRIORITY_HIGH) {
     timer event=new Timer(1000, wdg_callback);
     while (1) {
        if (receive(msg, TIMEOUT)) {
           // handle msg       
        } else { // TIMEOUT expired 
           cancel_event(event);
           event = new Timer(1000,wdg_callback);
        }
     }
}

But the problem with that approach is that it would only catch a problem if the watchdog process was starved or there was a larger issue with the RTOS. It wouldn't catch a problem with any particular process. Or am I missing something? — user946230, Nov 04 '12 at 14:55
@user946230: no you are right. The issue is addressed in the update. This reduces the number of messages sent. Also one can code the watchdog features inside the 'idle' process, which typically has the PID=0 and simultaneously catch the most typical case of lost messages. — Aki Suihkonen, Nov 05 '12 at 14:26

score 1 · Answer 5 · answered Mar 14 '13 at 15:06

Other answers have covered your question, I would just like to suggest you add something in your old procedure (without RTOS). Do not kick the watchdog unconditionally from the main() only, it is possible that some ISR would stuck, but the system would continue working without notice (the problem Dan has mentioned also related to RTOS).

What I have always been doing was relating the main and the timer interrupt so that within the timer a countdown has been done on a variable until it was zero, and from the main I would check if it was zero, and only then feeding the watchdog. Of course, after feeding return the variable to initial value. Simple, if the variable stopped decrementing, you get the reset. If main stops feeding the watchdog, you get the reset.

This concept is easy to apply for known periodic events only, but it is still better then do everything just from the main. Another benefit is that garbled code is not so likely to kick the watchdog because your watchdog feed procedure within the main has ended within some wild loop.

How to use the watchdog timer in a RTOS?

5 Answers5