Can I parallelize this small Python script with a global dict?

Question

I have this problem that takes about 2.8 seconds in my MBA with Python3. Since at its core we have a caching dictionary, I figure that it doesn't matter which call hits the cache first, and so maybe I can get some gains from threading. I can't quite figure it out, though. This is a bit higher level than the questions I normally ask, but can someone walk me through the parallelization process for this problem?

import time
import threading

even = lambda n: n%2==0
next_collatz = lambda n: n//2 if even(n) else 3*n+1

cache = {1: 1}
def collatz_chain_length(n):
    if n not in cache: cache[n] = 1 + collatz_chain_length(next_collatz(n))
    return cache[n]

if __name__ == '__main__':
    valid = range(1, 1000000)
    for n in valid:
        # t = threading.Thread(target=collatz_chain_length, args=[n] )
        # t.start()
        collatz_chain_length(n)

    print( max(valid, key=cache.get) )

Or, if it is a bad candidate, why?

Does this answer your question? [How do I parallelize a simple Python loop?](https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop) — Tomerikoo, Feb 04 '21 at 13:30
@Tomerikoo The task is more difficult here. The thing is that there is a shared dictionary-based cache that needs to be updated by all processes. And also this cache read/writes consume most of the computing time. It means if you use shared dict between processes whole program will become much slower than single-core version. So simple parallelization will not help. — Arty, Feb 04 '21 at 13:34
@Tomerikoo Cache here is needed and used very well. It is well-known Collatz problem here being solved. Many numbers will be reused on the way, and there will be many cache hits. — Arty, Feb 04 '21 at 13:36

score 1 · Answer 1 · answered Apr 26 '15 at 17:23

You won't get a good boost out of threading in Python if your workload is CPU intensive. That's because only one thread will actually be using the processor at a time due to the GIL (Global Interpreter Lock).

However, if your workload was I/O bound (waiting for responses from a network request for example), threads would give you a bit of a boost, because if your thread is blocked waiting for a network response, another thread can do useful work.

As HDN mentioned, using multiprocessing will help - this uses multiple Python interpreters to get the work done.

The way I would approach this is to divide the number of iterations by the number of processes you plan to create. For example if you create 4 processes, give each process a 1000000/4 slice of the work.

At the end you will need to aggregate the results of each process and apply your max() to get the result.

score 0 · Answer 2 · answered Apr 26 '15 at 17:16

Threading won't give you much in terms of performance gains because it won't get around the Global Interpreter Lock, which will only run one thread at any given moment. It might actually even slow you down because of the context switching.

If you want to leverage parallelization for performance purposes in Python, you're going to have to use multiprocessing to actually leverage more than one core at a time.

Arty · Answer 3 · 2021-02-04T13:30:58.843

I managed to speedup your code 16.5x times on single core, read further.

As said before multi-threading doesn't give any improvement in pure Python, because of Global Interpreter Lock.

Regarding multi-processing - there are two options 1) to implement shared dictionary and read/write to it directly from different processes. 2) to cut range of values into parts and solve task for separate subranges on different processes, then just take maximum out of all processes answers.

First option will be very slow, because in your code reading/writing to dictionary is main time-consuming operation, using shared between processes dictionary will slow it down 5 times more giving no improvements from multi-core.

Second option will give some improvement but also not great because different processes will recompute same values many times. This option will give considerable improvement only if you have very many cores or use many separate machines in cluster.

I decided to implement another way to improve your task (option-3) - to use Numba and to do other optimizations. My solution is then also suitable for option-2 (parallelization of sub-ranges).

Numba is Just-in-Time compiler and optimizer, it converts pure Python code to optimized C++ and then compiles to machine code. Numba can usually give 10x-100x times speedup.

To run code with numba you just need to install pip install numba (currently Numba is supported for Python version <= 3.8, support for 3.9 will be soon too!).

All improvements that I did gave 16.5x times speedup on single-core (e.g. if on your algorithm it was 64 seconds for some range then on my code it is 4 seconds).

I had to rewrite your code, algorithm and idea is same like yours, but I made algorithm non-recursive (because Numba doesn't deal well with recursion) and also used list instead of dictionary for not too large values.

My single core numba-based version may use sometimes too much of memory, that is only because of cs parameter which controls threshold for using list instead of dictionary, currently this cs is set to be stop * 10 (search this in code), if you don't have much memory just set it to e.g. stop * 2 (but not less than stop * 1). I have 16GB of memory and program runs correctly even for 64000000 upper limit.

Also besides Numba code I implemented C++ solution, it appeared to be same in speed like Numba, it means Numba did a good work! C++ code is located after Python code.

I did timings measurement of your algorithm (solve_py()) and my (solve_nm()) and compared them. Timings are listed after code.

Just for reference, I did multi-core processing version too using my numba solution, but it didn't give any improvements over single-core version, there was even slow-down. That all happened because multi-core version computed same values many times. Maybe multi-machine version will give noticable improvement, but probably not multi-core.

Try-it-online links below allow to run only small ranges because of limited memory on thos free online servers!

Try it online!

import time, threading, time, numba

def solve_py(start, stop):
    even = lambda n: n%2==0
    next_collatz = lambda n: n//2 if even(n) else 3*n+1

    cache = {1: 1}
    def collatz_chain_length(n):
        if n not in cache: cache[n] = 1 + collatz_chain_length(next_collatz(n))
        return cache[n]

    for n in range(start, stop):
        collatz_chain_length(n)

    r = max(range(start, stop), key = cache.get)
    return r, cache[r]

@numba.njit(cache = True, locals = {'n': numba.int64, 'l': numba.int64, 'zero': numba.int64})
def solve_nm(start, stop):
    zero, l, cs = 0, 0, stop * 10
    ns = [zero] * 10000
    cache_lo = [zero] * cs
    cache_lo[1] = 1
    cache_hi = {zero: zero}
    for n in range(start, stop):
        if cache_lo[n] != 0:
            continue
        nsc = 0
        while True:
            if n < cs:
                cg = cache_lo[n]
            else:
                cg = cache_hi.get(n, zero)
            if cg != 0:
                l = 1 + cg
                break
            ns[nsc] = n
            nsc += 1
            n = (n >> 1) if (n & 1) == 0 else 3 * n + 1
        for i in range(nsc - 1, -1, -1):
            if ns[i] < cs:
                cache_lo[ns[i]] = l
            else:
                cache_hi[ns[i]] = l
            l += 1
    maxn, maxl = 0, 0
    for k in range(start, stop):
        v = cache_lo[k]
        if v > maxl:
            maxn, maxl = k, v
    return maxn, maxl

if __name__ == '__main__':
    solve_nm(1, 100000) # heat-up, precompile numba
    for stop in [1000000, 2000000, 4000000, 8000000, 16000000, 32000000, 64000000]:
        tr, resr = None, None
        for is_nm in [False, True]:
            if stop > 16000000 and not is_nm:
                continue
            tb = time.time()
            res = (solve_nm if is_nm else solve_py)(1, stop)
            te = time.time()
            print(('py', 'nm')[is_nm], 'limit', stop, 'time', round(te - tb, 2), 'secs', end = '')
            if not is_nm:
                resr, tr = res, te - tb
                print(', n', res[0], 'len', res[1])
            else:
                if tr is not None:
                    print(', boost', round(tr / (te - tb), 2))
                    assert resr == res, (resr, res)
                else:
                    print(', n', res[0], 'len', res[1])

Output:

py limit 1000000 time 3.34 secs, n 837799 len 525
nm limit 1000000 time 0.19 secs, boost 17.27
py limit 2000000 time 6.72 secs, n 1723519 len 557
nm limit 2000000 time 0.4 secs, boost 16.76
py limit 4000000 time 13.47 secs, n 3732423 len 597
nm limit 4000000 time 0.83 secs, boost 16.29
py limit 8000000 time 27.32 secs, n 6649279 len 665
nm limit 8000000 time 1.68 secs, boost 16.27
py limit 16000000 time 55.42 secs, n 15733191 len 705
nm limit 16000000 time 3.48 secs, boost 15.93
nm limit 32000000 time 7.38 secs, n 31466382 len 706
nm limit 64000000 time 16.83 secs, n 63728127 len 950

C++ version of same algorithm as Numba is located below:

Try it online!

#include <cstdint>
#include <vector>
#include <unordered_map>
#include <tuple>
#include <iostream>
#include <stdexcept>
#include <chrono>

typedef int64_t i64;

static std::tuple<i64, i64> Solve(i64 start, i64 stop) {
    i64 cs = stop * 10, n = 0, l = 0, nsc = 0;
    std::vector<i64> cache_lo(cs), ns(10000);
    cache_lo[1] = 1;
    std::unordered_map<i64, i64> cache_hi;
    for (i64 i = start; i < stop; ++i) {
        if (cache_lo[i] != 0)
            continue;
        n = i;
        nsc = 0;
        while (true) {
            i64 cg = 0;
            if (n < cs)
                cg = cache_lo[n];
            else {
                auto it = cache_hi.find(n);
                if (it != cache_hi.end())
                    cg = it->second;
            }
            if (cg != 0) {
                l = 1 + cg;
                break;
            }
            ns.at(nsc) = n;
            ++nsc;
            n = (n & 1) ? 3 * n + 1 : (n >> 1);
        }
        for (i64 i = nsc - 1; i >= 0; --i) {
            i64 n = ns[i];
            if (n < cs)
                cache_lo[n] = l;
            else
                cache_hi[n] = l;
            ++l;
        }
    }
    i64 maxn = 0, maxl = 0;
    for (size_t i = start; i < stop; ++i)
        if (cache_lo[i] > maxl) {
            maxn = i;
            maxl = cache_lo[i];
        }
    return std::make_tuple(maxn, maxl);
}

int main() {
    try {
        for (auto stop: std::vector<i64>({1000000, 2000000, 4000000, 8000000, 16000000, 32000000, 64000000})) {
            auto tb = std::chrono::system_clock::now();
            auto r = Solve(1, stop);
            auto te = std::chrono::system_clock::now();
            std::cout << "cpp limit " << stop
                << " time " << double(std::chrono::duration_cast<std::chrono::milliseconds>(te - tb).count()) / 1000.0 << " secs"
                << ", n " << std::get<0>(r) << " len " << std::get<1>(r) << std::endl;
        }
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

cpp limit 1000000 time 0.17 secs, n 837799 len 525
cpp limit 2000000 time 0.357 secs, n 1723519 len 557
cpp limit 4000000 time 0.757 secs, n 3732423 len 597
cpp limit 8000000 time 1.571 secs, n 6649279 len 665
cpp limit 16000000 time 3.275 secs, n 15733191 len 705
cpp limit 32000000 time 7.112 secs, n 31466382 len 706
cpp limit 64000000 time 17.165 secs, n 63728127 len 950

Can I parallelize this small Python script with a global dict?

3 Answers3