Questions tagged [fault-tolerance]

Fault tolerance refers to a system's capability to isolate, compensate for and recover from failure with minimal impact to the end user. When using this tag - include tags indicating the system and/or technology you are working with (as additional support meta-data).

276 questions
6
votes
2 answers

Apache Storm: Track tuples by unique ID from Source Spout to Final Bolt

I want a method of uniquely identifying tuples throughout a whole Storm topology, so that each tuple can be tracked from Spout to the final Bolt. The way I understand it is when passing a unique message id with an emit from a spout for example:…
perkss
  • 872
  • 10
  • 30
6
votes
1 answer

How is the detection of terminated nodes in Erlang working? How is net_ticktime influencing the control of node liveness in Erlang?

I set net_ticktime value to 600 seconds. net_kernel:set_net_ticktime(600) In Erlang documentation for net_ticktime = TickTime: Specifies the net_kernel tick time. TickTime is given in seconds. Once every TickTime/4 second, all connected nodes are…
Zuzana
  • 132
  • 6
6
votes
1 answer

Service Stack Redis reconnect after Redis server reboot

We are using Service Stack's RedisClient's BlockingDequeue to persist some data until it can be processed. The calling code looks like using (var client = ClientPool.GetClient()) return…
swestner
  • 1,766
  • 14
  • 17
6
votes
1 answer

Store and forward HTTP requests with retries?

Twilio and other HTTP-driven web services have the concept of a fallback URL, where the web services sends a GET or POST to a URL of your choice if the main URL times out or otherwise fails. In the case of Twilio, they will not retry the request if…
Matt J
  • 39,051
  • 7
  • 44
  • 56
6
votes
2 answers

More graceful error handling in C++ library - jsoncpp

I'm not sure if this will be a specific thing with jsoncpp or a general paradigm with how to make a C++ library behave better. Basically I'm getting this trace: imagegeneratormanager.tsk: src/lib_json/json_value.cpp:1176: const Json::Value& …
djechlin
  • 54,898
  • 29
  • 144
  • 264
5
votes
3 answers

Disable tolerance (or enable strictness) in Firefox when rendering HTML

Firefox has a certain tolerance when rendering bad HTML. This means even if a closing tag is left out, the HTML will be displayed as if everything was fine. This tolerance aspect is particularly relevant when one is using JavaScript to manipulate or…
unode
  • 8,181
  • 4
  • 30
  • 44
5
votes
1 answer

Handle Akka actor bounded mailbox MessageQueueAppendFailedException

To avoid OOM, I'm bounding the mailbox size of some of my Akka 1.1.3 actors with a shared custom dispatcher. For example: object Static { val dispatcher = Dispatchers.newExecutorBasedEventDrivenWorkStealingDispatcher( …
Bluu
  • 4,327
  • 4
  • 27
  • 33
5
votes
5 answers

Fail fast finally clause in Java

Is there a way to detect, from within the finally clause, that an exception is in the process of being thrown? See the example below: try { // code that may or may not throw an exception } finally { SomeCleanupFunctionThatThrows(); //…
Greg Rogers
  • 33,366
  • 15
  • 63
  • 93
5
votes
2 answers

Misunderstanding of spark RDD fault tolerant

Many say: Spark does not replicate data in hdfs. Spark arranges the operations in DAG graph.Spark builds RDD lineage. If a RDD is lost they can be rebuilt with the help of lineage graph. So there is no need of data replication as the RDDS can be…
5
votes
3 answers

How do I isolate untrusted native code in Java?

I have a piece of C library that I don't trust (in the sense that it might crash frequently). I am calling this from a Java process. To prevent the crash in C library bringing the whole Java app. down, I figured it will be best if I spawn a…
Enno Shioji
  • 25,422
  • 13
  • 67
  • 104
5
votes
1 answer

How to robustly, but minimally, distribute items across a peer-to-peer system

If one has a peer-to-peer system that can be queried, one would like to reduce the total number of queries across the network (by distributing "popular" items widely and "similar" items together) avoid excess storage at each node assure good…
5
votes
2 answers

Error monitoring/handling on webservers

We have a web server that we're about to launch a number of applications onto. They will all share database and memcached servers, but each application has it's own mySQL database and all memcached keys per application, is prefixed. Possible…
Industrial
  • 36,181
  • 63
  • 182
  • 286
5
votes
4 answers

Testing fault tolerant code

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the…
Robert
  • 6,198
  • 2
  • 31
  • 40
5
votes
2 answers

Articles about replication schemes/algorithms?

I'm designing a distributed system with a certain flow of data in it. I'd like to guarantee that at least N nodes have almost-current data at any given time. I do not need complete consistency, only eventual consistency (t.i. for any time instant,…
jkff
  • 16,670
  • 3
  • 46
  • 79
5
votes
3 answers

How to configure fault tolerance programmatically for a spring tasklet (not a chunk)

Programmatically configuring fault tolerance for a chunk works kind of as follows: stepBuilders.get("step") .chunk(1) .reader(reader()) .processor(processor()) .writer(writer()) .listener(logProcessListener()) …
1 2
3
18 19