5

I have an application which polls a bunch of servers every few minutes. To do this, it spawns one thread per server to poll (15 servers) and writes back the data to an object:

import requests

class ServerResults(object):
    def __init__(self):
        self.results = []

    def add_server(some_argument):
        self.results.append(some_argument)

servers = ['1.1.1.1', '1.1.1.2']
results = ServerResults()

for s in servers:
    t = CallThreads(poll_server, s, results)
    t.daemon = True
    t.start()

def poll_server(server, results):
    response = requests.get(server, timeout=10)
    results.add_server(response.status_code);

The CallThreads class is a helper function to call a function (in this case poll_server() with arguments (in this case s and results), you can see the source at my Github repo of Python utility functions. Most of the time this works fine, however sometimes a thread intermittently hangs. I'm not sure why, since I am using a timeout on the GET request. In any case, if the thread hangs then the hung threads build up over the course of hours or days, and then Python crashes:

  File "/usr/lib/python2.7/threading.py", line 495, in start
    _start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread

Exception in thread Thread-575 (most likely raised during interpreter shutdown)
Exception in thread Thread-1671 (most likely raised during interpreter shutdown)
Exception in thread Thread-831 (most likely raised during interpreter shutdown)

How might I deal with this? There seems to be no way to kill a blocking thread in Python. This application needs to run on a Raspberry Pi, so large libraries such as twisted won't fit, in fact I need to get rid of the requests library as well!

Community
  • 1
  • 1
dotancohen
  • 26,432
  • 30
  • 122
  • 179
  • Firstly, is this on a pi when it hangs, or are you testing elsewhere? There may be platform-specific tools that would help see what the thread is doing, but you haven't specified your platform. – Useless Jul 30 '13 at 12:23
  • Secondly, what is `requests`? Without seeing that, it's impossible to say if there's a race condition in there – Useless Jul 30 '13 at 12:25
  • Thirdly, even if you don't use twisted, synchronous non-blocking I/O in a single thread is much more scalable than this – Useless Jul 30 '13 at 12:27
  • Thank you Useless. On the Raspberry Pi the application runs for a day or two, then crashes. On a Kubuntu desktop it runs for at least a few days, but starts consuming large amounts of memory, on the order of a few GiB (RSS). I added a link to the [Python-Requests](http://www.python-requests.org/) library in the question. – dotancohen Jul 30 '13 at 13:35
  • Killing the threads is a terrible idea (really, it can leave stuff in a bad state). If the process bloats on Kubuntu, you can attach gdb and get a stack trace for each thread - you should be able to confirm what was happening when it hung. If it's blocked in `read` or `recv`, you can close the fd and detach to see if it recovers. – Useless Jul 30 '13 at 14:23
  • Thanks, I'll take a look at gdb. The issue is that only a small portion of threads start hanging (on the order of 1%) so I do need a lot of 'luck' to catch a hung one. Have you any tips in that regard? Thank you! – dotancohen Jul 30 '13 at 15:12
  • Once it's hung, it should stay hung and stick around ... – Useless Jul 30 '13 at 15:18
  • Thanks, I've been googling and found some good info, but I do see that I'll need quite a bit of practice before I become even close to being able to debug the hangs. What does it mean to "close the fd and detach to see if it recovers"? Thank you! – dotancohen Jul 30 '13 at 16:14

1 Answers1

4

As far as I can tell, a possible scenario is that when a thread "hangs" for one given server, it will stay there "forever". Next time you query your servers another thread is spawned (_start_new_thread), up to the point where Python crashes.

Probably not your (main) problem, but you should:

  • use a thread pool - this won't stress the limited resources of your your system as much as spawning new threads again and again.
  • check that you use a "thread-compatible" mechanism to handle concurrent access to results. Maybe a semaphore or mutex to lock atomic portions of your code. Probably better would be a dedicated data structure such as a queue.

Concerning the "hang" per se -- beware that the timeout argument while "opening a URL" (urlopen) is related to the time-out for establishing the connection. Not for downloading the actual data:

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.

Hannele
  • 7,820
  • 5
  • 46
  • 64
Sylvain Leroux
  • 44,729
  • 6
  • 86
  • 107
  • Thank you. Actually, I am aware that the timeout is only for the connection, `requests` does not seem to expose a way to timeout the download. I like the idea of one thread per server, but when the thread hangs, then the server in question will no longer be polled. – dotancohen Jul 30 '13 at 10:23
  • @dotancohen Neither urllib2 not requests expose the socket object I think. So, you probably have to [socket.setdefaulttimeout](http://docs.python.org/2/library/socket.html#socket.setdefaulttimeout) to fix I/O operation timeout *just before* opening your connection (and probably resetting it afterward). Concerning the use of a thread pool, my *guess* was when a server block an incoming request from a thread it will block all subsequent attempts to connect. but maybe threads hangs "at random"? – Sylvain Leroux Jul 30 '13 at 10:33
  • Thank you Sylvain, I'll take a look at the socket timeout. Nice find! The servers do allow subsequent connections, otherwise the following threads would fail to poll them. I really don't know what causes the hangup or even where to begin to debug it. – dotancohen Jul 30 '13 at 10:35
  • @dotancohen "where to begin to debug it" have you tried to capture network traffic using a tool like WireShark. In order to see if the connection was established properly for example. – Sylvain Leroux Jul 30 '13 at 10:47
  • Actually, I haven't. I was hoping to stay in Python, as the last time I tried to use WireShark (actually, it was etheral the time) it was a complete pain. I do realize that it has matured much since then, though. I was planning on using pdb to see which code is run, and then placing debug print statements and seeing where the trail leaves off. Primitive, I know! I just haven't gotten to it yet, and I won't get to it if I can just kill these threads. A missing poll occasionally would not be a problem. – dotancohen Jul 30 '13 at 11:15