I managed to speedup your code 16.5x
times on single core, read further.
As said before multi-threading doesn't give any improvement in pure Python, because of Global Interpreter Lock.
Regarding multi-processing - there are two options 1) to implement shared dictionary and read/write to it directly from different processes. 2) to cut range of values into parts and solve task for separate subranges on different processes, then just take maximum out of all processes answers.
First option will be very slow, because in your code reading/writing to dictionary is main time-consuming operation, using shared between processes dictionary will slow it down 5 times more giving no improvements from multi-core.
Second option will give some improvement but also not great because different processes will recompute same values many times. This option will give considerable improvement only if you have very many cores or use many separate machines in cluster.
I decided to implement another way to improve your task (option-3) - to use Numba and to do other optimizations. My solution is then also suitable for option-2 (parallelization of sub-ranges).
Numba is Just-in-Time compiler and optimizer, it converts pure Python code to optimized C++ and then compiles to machine code. Numba can usually give 10x-100x times speedup.
To run code with numba you just need to install pip install numba
(currently Numba is supported for Python version <= 3.8, support for 3.9 will be soon too!).
All improvements that I did gave 16.5x
times speedup on single-core (e.g. if on your algorithm it was 64 seconds for some range then on my code it is 4 seconds).
I had to rewrite your code, algorithm and idea is same like yours, but I made algorithm non-recursive (because Numba doesn't deal well with recursion) and also used list instead of dictionary for not too large values.
My single core numba-based version may use sometimes too much of memory, that is only because of cs
parameter which controls threshold for using list instead of dictionary, currently this cs
is set to be stop * 10
(search this in code), if you don't have much memory just set it to e.g. stop * 2
(but not less than stop * 1
). I have 16GB of memory and program runs correctly even for 64000000 upper limit.
Also besides Numba code I implemented C++ solution, it appeared to be same in speed like Numba, it means Numba did a good work! C++ code is located after Python code.
I did timings measurement of your algorithm (solve_py()
) and my (solve_nm()
) and compared them. Timings are listed after code.
Just for reference, I did multi-core processing version too using my numba solution, but it didn't give any improvements over single-core version, there was even slow-down. That all happened because multi-core version computed same values many times. Maybe multi-machine version will give noticable improvement, but probably not multi-core.
Try-it-online links below allow to run only small ranges because of limited memory on thos free online servers!
Try it online!
import time, threading, time, numba
def solve_py(start, stop):
even = lambda n: n%2==0
next_collatz = lambda n: n//2 if even(n) else 3*n+1
cache = {1: 1}
def collatz_chain_length(n):
if n not in cache: cache[n] = 1 + collatz_chain_length(next_collatz(n))
return cache[n]
for n in range(start, stop):
collatz_chain_length(n)
r = max(range(start, stop), key = cache.get)
return r, cache[r]
@numba.njit(cache = True, locals = {'n': numba.int64, 'l': numba.int64, 'zero': numba.int64})
def solve_nm(start, stop):
zero, l, cs = 0, 0, stop * 10
ns = [zero] * 10000
cache_lo = [zero] * cs
cache_lo[1] = 1
cache_hi = {zero: zero}
for n in range(start, stop):
if cache_lo[n] != 0:
continue
nsc = 0
while True:
if n < cs:
cg = cache_lo[n]
else:
cg = cache_hi.get(n, zero)
if cg != 0:
l = 1 + cg
break
ns[nsc] = n
nsc += 1
n = (n >> 1) if (n & 1) == 0 else 3 * n + 1
for i in range(nsc - 1, -1, -1):
if ns[i] < cs:
cache_lo[ns[i]] = l
else:
cache_hi[ns[i]] = l
l += 1
maxn, maxl = 0, 0
for k in range(start, stop):
v = cache_lo[k]
if v > maxl:
maxn, maxl = k, v
return maxn, maxl
if __name__ == '__main__':
solve_nm(1, 100000) # heat-up, precompile numba
for stop in [1000000, 2000000, 4000000, 8000000, 16000000, 32000000, 64000000]:
tr, resr = None, None
for is_nm in [False, True]:
if stop > 16000000 and not is_nm:
continue
tb = time.time()
res = (solve_nm if is_nm else solve_py)(1, stop)
te = time.time()
print(('py', 'nm')[is_nm], 'limit', stop, 'time', round(te - tb, 2), 'secs', end = '')
if not is_nm:
resr, tr = res, te - tb
print(', n', res[0], 'len', res[1])
else:
if tr is not None:
print(', boost', round(tr / (te - tb), 2))
assert resr == res, (resr, res)
else:
print(', n', res[0], 'len', res[1])
Output:
py limit 1000000 time 3.34 secs, n 837799 len 525
nm limit 1000000 time 0.19 secs, boost 17.27
py limit 2000000 time 6.72 secs, n 1723519 len 557
nm limit 2000000 time 0.4 secs, boost 16.76
py limit 4000000 time 13.47 secs, n 3732423 len 597
nm limit 4000000 time 0.83 secs, boost 16.29
py limit 8000000 time 27.32 secs, n 6649279 len 665
nm limit 8000000 time 1.68 secs, boost 16.27
py limit 16000000 time 55.42 secs, n 15733191 len 705
nm limit 16000000 time 3.48 secs, boost 15.93
nm limit 32000000 time 7.38 secs, n 31466382 len 706
nm limit 64000000 time 16.83 secs, n 63728127 len 950
C++ version of same algorithm as Numba is located below:
Try it online!
#include <cstdint>
#include <vector>
#include <unordered_map>
#include <tuple>
#include <iostream>
#include <stdexcept>
#include <chrono>
typedef int64_t i64;
static std::tuple<i64, i64> Solve(i64 start, i64 stop) {
i64 cs = stop * 10, n = 0, l = 0, nsc = 0;
std::vector<i64> cache_lo(cs), ns(10000);
cache_lo[1] = 1;
std::unordered_map<i64, i64> cache_hi;
for (i64 i = start; i < stop; ++i) {
if (cache_lo[i] != 0)
continue;
n = i;
nsc = 0;
while (true) {
i64 cg = 0;
if (n < cs)
cg = cache_lo[n];
else {
auto it = cache_hi.find(n);
if (it != cache_hi.end())
cg = it->second;
}
if (cg != 0) {
l = 1 + cg;
break;
}
ns.at(nsc) = n;
++nsc;
n = (n & 1) ? 3 * n + 1 : (n >> 1);
}
for (i64 i = nsc - 1; i >= 0; --i) {
i64 n = ns[i];
if (n < cs)
cache_lo[n] = l;
else
cache_hi[n] = l;
++l;
}
}
i64 maxn = 0, maxl = 0;
for (size_t i = start; i < stop; ++i)
if (cache_lo[i] > maxl) {
maxn = i;
maxl = cache_lo[i];
}
return std::make_tuple(maxn, maxl);
}
int main() {
try {
for (auto stop: std::vector<i64>({1000000, 2000000, 4000000, 8000000, 16000000, 32000000, 64000000})) {
auto tb = std::chrono::system_clock::now();
auto r = Solve(1, stop);
auto te = std::chrono::system_clock::now();
std::cout << "cpp limit " << stop
<< " time " << double(std::chrono::duration_cast<std::chrono::milliseconds>(te - tb).count()) / 1000.0 << " secs"
<< ", n " << std::get<0>(r) << " len " << std::get<1>(r) << std::endl;
}
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
cpp limit 1000000 time 0.17 secs, n 837799 len 525
cpp limit 2000000 time 0.357 secs, n 1723519 len 557
cpp limit 4000000 time 0.757 secs, n 3732423 len 597
cpp limit 8000000 time 1.571 secs, n 6649279 len 665
cpp limit 16000000 time 3.275 secs, n 15733191 len 705
cpp limit 32000000 time 7.112 secs, n 31466382 len 706
cpp limit 64000000 time 17.165 secs, n 63728127 len 950