16

Background

I have a Akka.NET cluster containing a Lighthouse seed node and two other nodes running actor systems. When I attempt to do a graceful shutdown on one of my cluster nodes I want to see that at least one of the other nodes receives a message about the node leaving and that all cluster nodes eventually exclude the leaving node of the list of nodes.

Once that's been taken care of I expect I should be able to shutdown the node without the two other nodes going nuts about not being able to connect to the node that shut down.

What I've tried

What I have right now is a Console Application wrapped in a TopShelf Application:

class ActorService : ServiceControl
{
    private ActorSystem _actorSystem;

    public bool Start(HostControl hostControl)
    {
        _actorSystem = ActorSystem.Create("myActorSystem");

        var cluster = Cluster.Get(_actorSystem);
        cluster.RegisterOnMemberRemoved(_Terminate);

        return true;
    }

    public bool Stop(HostControl hostControl)
    {
        var cluster = Cluster.Get(_actorSystem);
        cluster.Leave(cluster.SelfAddress);
        return true;
    }

    private void _Terminate()
    {
        _actorSystem.Terminate();
    }
}

Here is my main:

class Program
{
    static int Main(string[] args)
    {
        return (int) HostFactory.Run(x =>
        {
            x.UseAssemblyInfoForServiceInfo();
            x.RunAsLocalSystem();
            x.StartAutomatically();
            x.Service<ActorService>();
            x.EnableServiceRecovery(r => r.RestartService(1));
        });
    }
}

When stepping through the Stop function, I can't see any received message about the node leaving on the other nodes. When the function returns however, the other nodes start spouting exceptions.

A user in the Akka.NET Gitter channel said:

I have observed the same thing even without TopShelf I must say, with a pure ASP.NET Core project after the webhost terminated.

Question

What can I add to have the other nodes receive a message about the node leaving?

maxpaj
  • 4,239
  • 3
  • 21
  • 42
  • Could you please post the exceptions you're seeing on the other nodes? – Aaronontheweb Jul 11 '16 at 18:58
  • As I am the one having this with ASP.NET Core webhost, I can see this exceptions: Starts with `Error caught channel` with `System.Net.Sockets.SocketException (0x80004005): The I/O operation has been aborted because of either a thread exit or an application request`. Then two dead letters that `Disassociated` messages have not been delivered. Then `Akka.Remote.ShutDownAssociation: Shut down address` with `Akka.Remote.Transport.InvalidAssociationException: The remote system terminated the association because it is shutting down.`. Followed by more dead letters. – ZoolWay Jul 12 '16 at 06:49
  • It will continue to produce `InvalidAssociation` because it then endlessly tries to reconnect to the shutdown node which should have left the cluster. – ZoolWay Jul 12 '16 at 06:50

2 Answers2

20

I think the problem is that the Stop() method completes before the leaving has completed. You should wait for the MemberRemoved event.

This Stop() method will wait until the MemberRemoved callback has been called and signaled that it even has terminated the actor system.

class Worker
{
    private static readonly ManualResetEvent asTerminatedEvent = new ManualResetEvent(false);
    private ActorSystem actorSystem;

    public void Start()
    {
        this.actorSystem = ActorSystem.Create("sample");
    }

    public void Stop()
    {
        var cluster = Akka.Cluster.Cluster.Get(actorSystem);
        cluster.RegisterOnMemberRemoved(() => MemberRemoved(actorSystem));
        cluster.Leave(cluster.SelfAddress);

        asTerminatedEvent.WaitOne();
        //log.Info("Actor system terminated, exiting");
    }

    private async void MemberRemoved(ActorSystem actorSystem)
    {
        await actorSystem.Terminate();
        asTerminatedEvent.Set();
    }

}

Note: I checked for three types of apps how to leave the cluster without problems. I have hosted that on GitHub. There are still some exceptions and a few dead letters when leaving but that the other nodes do no longer continuously try to reconnect to the exited node.

ZoolWay
  • 5,071
  • 4
  • 35
  • 69
8

I wanted to post an update on this thread here since we've since added a new feature to Akka.NET since this answer was originally accepted: CoordinatedShutdown

It does what @ZoolWay's answer does under the hood and more, but to use it all you have to do is the following:

class Worker
{
    private ActorSystem actorSystem;

    public void Start()
    {
        this.actorSystem = ActorSystem.Create("sample");
    }

    public void Stop()
    {
        Task<Done> shutdownTask = CoordinatedShutdown.Get(actorSystem).Run(CoordinatedShutdown.ClrExitReason.Instance);
        shutdownTask.Wait();
    }

}

This is simpler and can handle more complex cleanup scenarios, such as shutting down Akka.Cluster.Sharding prior to terminating the cluster itself. This is the recommended way of doing things since Akka.NET 1.3.2, I believe.

Aaronontheweb
  • 7,370
  • 6
  • 27
  • 55
  • Am I correct in thinking this solution will broadcast a "stop and drain" to every node in the cluster, essentially shutting every cluster node down? I believe the accepted answer will instruct only the one node to leave the cluster, wait for acknowledgement, then drain the (now disconnected, local) actor system. – Dan Nov 22 '18 at 08:54
  • > Am I correct in thinking this solution will broadcast a "stop and drain" to every node in the cluster, No - it only shuts down the current local node. – Aaronontheweb Nov 22 '18 at 19:33