Questions tagged [fault-tolerance]

Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).

276 questions
1515
votes
23 answers

Compiling an application for use in highly radioactive environments

We are compiling an embedded C++ application that is deployed in a shielded device in an environment bombarded with ionizing radiation. We are using GCC and cross-compiling for ARM. When deployed, our application generates some erroneous data and…
rook
  • 62,960
  • 36
  • 149
  • 231
73
votes
3 answers

Akka Actor not terminating if an exception is thrown

I am currently trying to get started with Akka and I am facing a weird problem. I've got the following code for my Actor: class AkkaWorkerFT extends Actor { def receive = { case Work(n, c) if n < 0 => throw new Exception("Negative number") …
fresskoma
  • 24,302
  • 9
  • 79
  • 122
72
votes
3 answers

Why is C++ template use not recommended in a space/radiated environment?

By reading this question, I understood, for instance, why dynamic allocation or exceptions are not recommended in environments where radiation is high, like in space or in a nuclear power plant. Concerning templates, I don't see why. Could you…
Guillaume D
  • 1,958
  • 2
  • 6
  • 30
39
votes
3 answers

Are Erlang/OTP messages reliable? Can messages be duplicated?

Long version: I'm new to erlang, and considering using it for a scalable architecture. I've found many proponents of the platform touting its reliability and fault tolerance. However, I'm struggling to understand exactly how fault-tolerance is…
joshng
  • 1,431
  • 14
  • 15
26
votes
4 answers

Scala + Akka: How to develop a Multi-Machine Highly Available Cluster

We're developing a server system in Scala + Akka for a game that will serve clients in Android, iPhone, and Second Life. There are parts of this server that need to be highly available, running on multiple machines. If one of those servers dies…
Unoti
  • 1,255
  • 10
  • 12
20
votes
1 answer

Hystrix: Custom circuit breaker and recovery logic

I just read the Hystrix guide and am trying to wrap my head around how the default circuit breaker and recovery period operate, and then how to customize their behavior. Obviously, if the circuit is tripped, Hystrix will automatically call the…
smeeb
  • 22,487
  • 41
  • 197
  • 389
19
votes
1 answer

Testing with probabilistic failure of components in Akka (Scala)

I've started using Akka with Scala to develop a set of interacting components in a bus-oriented architecture. I need to test the fault-tolerance of the system, and for that I was wondering if there is any way to use a probabilistic model of failure…
Hugo Sereno Ferreira
  • 8,665
  • 6
  • 41
  • 88
16
votes
2 answers

How can I simulate ext3 filesystem corruption?

I would like to simulate filesystem corruption for the purpose of testing how our embedded systems react to it and ultimately have them fail as gracefully as possible. We use different kinds of block device emulated flash storage for data which is…
David Holm
  • 15,666
  • 6
  • 44
  • 47
15
votes
4 answers

Resources about crash-safe and fault-tolerance programming

I like the LWN article "Crash-only software" and I would like to learn more about crash-safe and fault-tolerant programming. It is surprisingly hard to assure that the persistent state is consistent in fault situations. Here I do not even talk about…
dmeister
  • 32,008
  • 19
  • 67
  • 92
15
votes
1 answer

Do I absolutely need a minimum of 3 nodes/servers for a Cassandra cluster or will 2 suffice?

Surely one can run a single node cluster but I'd like some level of fault-tolerance. At present I can afford to lease two servers (8GB RAM, private VLAN @1GigE) but not 3. My understanding is that 3 nodes is the minimum needed for a Cassandra…
z8000
  • 3,655
  • 3
  • 27
  • 36
14
votes
4 answers

How do I automatically re-establish a duplex channel if it gets faulted?

I'm developing a client/server application in .Net 3.5 using WCF. Basically, a long running client service (on several machines) establish a duplex connection to the server over a netTcpBinding. The server then uses the callback contract of the…
Jacob
  • 21,087
  • 7
  • 37
  • 55
13
votes
5 answers

Fault tolerant software architecture

I'm looking for some good articles on fault tolerant software architectures. Could I please have some recommendations.
macleojw
  • 3,937
  • 10
  • 38
  • 60
13
votes
2 answers

Best Practices of fault toleration and reliability for scheduled tasks or services

I have been working on many applications which run as windows service or scheduled tasks. Now, i want to make sure that these applications will be fault tolerant and reliable. For example; i have a service that runs every hour. if the service…
DarthVader
  • 46,241
  • 67
  • 190
  • 289
12
votes
4 answers

How is Erlang fault tolerant, or help in that regard?

How is Erlang fault tolerant, or help in that regard?
Blankman
  • 236,778
  • 296
  • 715
  • 1,125
12
votes
1 answer

quartz jobDetail requestRecovery

The documentation for JobDetail.requestsRecovery property states the following Instructs the Scheduler whether or not the Job should be re-executed if a 'recovery' or 'fail-over' situation is encountered. Now, what is a 'recovery' situation or…
user1746050
  • 1,945
  • 4
  • 15
  • 25
1
2 3
18 19