16

I developed a python program that does heavy numerical calculations. I run it on a linux machine with 32 Xeon CPUs, 64GB RAM, and Ubuntu 14.04 64-bit. I launch multiple python instances with different model parameters in parallel to use multiple processes without having to worry about the global interpreter lock (GIL). When I monitor the cpu utilization using htop, I see that all cores are used, however most of the time by kernel. Generally, the kernel time is more than twice the user time. I'm afraid that there is a lot of overhead going on on the system level, but I'm not able to find the cause for this.

How would one reduce the high kernel CPU usage?

Here are some observation I made:

  • This effect appears independent of whether I run 10 jobs or 50. If there are fewer jobs than cores, not all cores are used, but the ones that are used still have a high CPU usage by the kernel
  • I implemented the inner loop using numba, but the problem is not related to this, since removing the numba part does not resolve the problem
  • I also though that it might be related to using python2 similar to the problem mentioned in this SO question but switching from python2 to python3 did not change much
  • I measured the total number of context switches performed by the OS, which is about 10000 per second. I'm not sure whether this is a large number
  • I tried increasing the python time slices by setting sys.setcheckinterval(10000) (for python2) and sys.setswitchinterval(10) (for python3) but none of this helped
  • I tried influencing the task scheduler by running schedtool -B PID but this didn't help

Edit: Here is a screenshot of htop: enter image description here

I also ran perf record -a -g and this is the report by perf report -g graph:

Samples: 1M of event 'cycles', Event count (approx.): 1114297095227                                   
-  95.25%          python3  [kernel.kallsyms]                           [k] _raw_spin_lock_irqsave   ◆
   - _raw_spin_lock_irqsave                                                                          ▒
      - 95.01% extract_buf                                                                           ▒
           extract_entropy_user                                                                      ▒
           urandom_read                                                                              ▒
           vfs_read                                                                                  ▒
           sys_read                                                                                  ▒
           system_call_fastpath                                                                      ▒
           __GI___libc_read                                                                          ▒
-   2.06%          python3  [kernel.kallsyms]                           [k] sha_transform            ▒
   - sha_transform                                                                                   ▒
      - 2.06% extract_buf                                                                            ▒
           extract_entropy_user                                                                      ▒
           urandom_read                                                                              ▒
           vfs_read                                                                                  ▒
           sys_read                                                                                  ▒
           system_call_fastpath                                                                      ▒
           __GI___libc_read                                                                          ▒
-   0.74%          python3  [kernel.kallsyms]                           [k] _mix_pool_bytes          ▒
   - _mix_pool_bytes                                                                                 ▒
      - 0.74% __mix_pool_bytes                                                                       ▒
           extract_buf                                                                               ▒
           extract_entropy_user                                                                      ▒
           urandom_read                                                                              ▒
           vfs_read                                                                                  ▒
           sys_read                                                                                  ▒
           system_call_fastpath                                                                      ▒
           __GI___libc_read                                                                          ▒
    0.44%          python3  [kernel.kallsyms]                           [k] extract_buf              ▒
    0.15%          python3  python3.4                                   [.] 0x000000000004b055       ▒
    0.10%          python3  [kernel.kallsyms]                           [k] memset                   ▒
    0.09%          python3  [kernel.kallsyms]                           [k] copy_user_generic_string ▒
    0.07%          python3  multiarray.cpython-34m-x86_64-linux-gnu.so  [.] 0x00000000000b4134       ▒
    0.06%          python3  [kernel.kallsyms]                           [k] _raw_spin_unlock_irqresto▒
    0.06%          python3  python3.4                                   [.] PyEval_EvalFrameEx       

It seems as if most of the time is spent calling _raw_spin_lock_irqsave. I have no idea what this means, though.

Cœur
  • 32,421
  • 21
  • 173
  • 232
David Zwicker
  • 20,698
  • 6
  • 54
  • 72
  • Have you checked what else is running on the system? The output from `top` would be interesting. – wallyk May 14 '15 at 19:36
  • I don't have anything out of the ordinary. This is not even the computer I use for my daily work. I added a screenshot to the question. – David Zwicker May 14 '15 at 19:48
  • Please run `perf record -a` from root (chdir to `/tmp`) and then `perf report`. It'll give you a clue what kernel is doing. – myaut May 14 '15 at 19:49
  • Thanks for the hint! I included the first lines of the report in the original question. It seems as if the call to `_raw_spin_lock_irqsave` dominates. I've no idea what this means, but I'll ask google. – David Zwicker May 14 '15 at 19:57
  • 1
    @DavidZwicker: nah, it only means competition for a spinlock in kernel, but we don't know which spinlock. Try to call `perf record -a -g` with `-g` option, it will collect callers of `_raw_spin_lock_irqsave`. See here: http://stackoverflow.com/questions/7031210/linux-perf-how-to-interpret-and-find-hotspots – myaut May 14 '15 at 20:02
  • I ran the commands and as far as I can see the functions seem to be related to some kind of random numbers. I should say that I use random numbers heavily in my numerics. Could it be that all these kernel CPU are just calls to a random number generator? – David Zwicker May 14 '15 at 20:10
  • 1
    Yes, you read `/dev/urandom` way too often. But Python should read it only during initialization of `random` module, or when you explicitly ask for it using `SystemRandom` or `os.urandom`... – myaut May 14 '15 at 20:21
  • Now that we have narrowed down the problem to the random numbers, I'll try to construct a minimal example demonstrating the problem and will then contact the numba/python people. Thanks for your help! – David Zwicker May 14 '15 at 20:23

1 Answers1

9

If the problem exists in kernel, you should narrow down a problem using a profiler such as OProfile or perf.

I.e. run perf record -a -g and than read profiling data saved into perf data using perf report. See also: linux perf: how to interpret and find hotspots.


In your case high CPU usage is caused by competition for /dev/urandom -- it allows only one thread to read from it, but multiple Python processes are doing so.

Python module random is using it only for initialization. I.e:

$ strace python -c 'import random;
while True:
    random.random()'
open("/dev/urandom", O_RDONLY)     = 4
read(4, "\16\36\366\36}"..., 2500) = 2500
close(4)                                   <--- /dev/urandom is closed

You may also explicitly ask for /dev/urandom by using os.urandom or SystemRandom class. So check your code which is dealing with random numbers.

Community
  • 1
  • 1
myaut
  • 10,138
  • 2
  • 23
  • 54
  • 2
    Thank you very much for your help! Now that I knew what I was looking for, I realized that I indeed had a bug in *my* code base where I would for some reason initialize the random number generator every time a certain base object was created. I was only able to find that part in some old code because I knew exactly what I was looking for! I'll accept your answer as soon as possible. – David Zwicker May 14 '15 at 20:51