Vectorize nested for loop Python

Question

I have a numpy array that I'm iterating through with:

import numpy
import math
array = numpy.array([[1, 1, 2, 8, 2, 2],
               [5, 5, 4, 1, 3, 2],
               [5, 5, 4, 1, 3, 2],
               [5, 5, 4, 1, 3, 2],
               [9, 5, 8, 8, 2, 2],
               [7, 3, 6, 6, 2, 2]])


Pixels = ['U','D','R','L','UL','DL','UR','DR']

for i in range (1,array.shape[0]-1):
    for j in range (1,array.shape[1]-1):


         list = []
         while len(list) < 2:
                iToMakeList = i
                jToMakeList = j

                if iToMakeList > array.shape[0]-1 or iToMakeList < 1 or jToMakeList> array.shape[0]-1 or jToMakeList < 1:

                    break

                PixelCoord = {
            'U' : (iToMakeList-1,jToMakeList),
            'D' : (iToMakeList+1,jToMakeList),
            'R' : (iToMakeList,jToMakeList+1),
            'L' : (iToMakeList,jToMakeList-1),
            'UL' : (iToMakeList-1,jToMakeList-1),
            'DL' : (iToMakeList+1,jToMakeList-1),
            'UR' : (iToMakeList-1,jToMakeList+1),
            'DR' : (iToMakeList+1,jToMakeList+1)
                }
                Value = {
            'U' : array[iToMakeList-1][jToMakeList],
            'D' : array[iToMakeList+1][jToMakeList],
            'R' : array[iToMakeList][jToMakeList+1],
            'L' : array[iToMakeList][jToMakeList-1],
            'UL' : array[iToMakeList-1][jToMakeList-1],
            'DL' : array[iToMakeList+1][jToMakeList-1],
            'UR' : array[iToMakeList-1][jToMakeList+1],
            'DR' : array[iToMakeList+1][jToMakeList+1]
                }


                candidates = []
                for pixel in Pixels:
                    candidates.append((Value[pixel],pixel))

                Lightest = max(candidates)


                list.append(PixelCoord[Lightest[1]])

                iToMakeList = PixelCoord[Lightest[1]][0]
                jToMakeList = PixelCoord[Lightest[1]][1]

I want to accelerate this process. It's very slow.

Assume that the output of this snippet is my final goal and the ONLY thing I want to do is accelerate this code.

Your code doesn't make sense, since `array,shape[0]` will be a number, not something you can iterate over. Also, how to vectorize it (or whether it is possible) will depend on the "stuff" you are doing inside the loop. — BrenBarn, Dec 28 '14 at 00:00
1). to use `for` loop properly, you need to form an `iterable` first, i.e. a `range` from the scalar `array.shape[0]`, in order to loop over it e.g. `np.arange(array.shape[0])`. 2). Please explain whether you are applying an operator on all the `array` elements or modifying them and storing them in some other `numpy` array? — Pacific Stickler, Dec 28 '14 at 01:20
Presumably `for i in array.shape[0]:` should be `for i in range(array.shape[0]):` (a mistake I've made more than once). — Warren Weckesser, Dec 28 '14 at 02:03
@PacificStickler, I am modifying them and performing operations on them individually. — user0, Dec 28 '14 at 04:19
so please verify if I am on the right track with your code: 1) you are trying to *slide* through your `array` in the `3x3 sub-matrix` fashion, 2) find the location of the `max` in this sub-matrix, and 3) append this value to a `list` until `list` is 100 elements long? So essentially you are doing a 2D convolution operation with a kernel that finds `maximum` of all values? — Pacific Stickler, Dec 30 '14 at 17:10
Correct. Could you explain more about the 2D convolution and kernel or refer me to a source? If this is a standard technique for which someone has already written an optimized function, then my problem is solved! — user0, Dec 31 '14 at 23:50
In your updated code, what is the purpose of setting `i = i-1` and `j = j`? This will not affect the loop iteration at all. — BrenBarn, Jan 01 '15 at 20:56
It will construct `list` where each element is the lightest pixel surrounding the pixel before it in the list — user0, Jan 01 '15 at 22:26
This would be much easier to determine what you need if you gave a sample input and desired output. I don't understand what your goal is. — Joel, Jan 03 '15 at 00:09
Am I correct that you actually want the `list = []` statement to be inside the `j` `for` loop? Otherwise, I don't see how it's going to get to the next value in the for loops. It'll just start with `i=0`, `j=0` and then walk forward from that point and fill up `list`. It'll never see `j=2` — Joel, Jan 03 '15 at 06:39
For me this sounds to be a typical GPU problem and you definitely should have a look into CUDA (or OpenCL) to solve this. 2D Convolution is the correct hint, but I can't think of any optimized implementation so far. A basic hint to get you on track of the 3x3 submatrix is the slicing ability of numpy arrays. Together with the numpy.max() function and multiprocessing you should get a pretty nice result. Let me know, if you want to go down that way and I can provide you with the code. — Dschoni, Jan 03 '15 at 16:38
@Joel, it will fill up the list that has `i=j=0` as its starting point and then move on to `i = 0, j = 1` and create another list and so on. For every `i,j` there will be a new 100 element list — user0, Jan 03 '15 at 23:36
Is that meant to be a line tracer? Because if yes, it might be worth looking at the gradient image. — Dschoni, Jan 03 '15 at 23:38
And you want to run that script on a huge amount of starting pixels? Because, at some point you will always fall back to the same lines. So @joel 's answer including memoization definitely is something to keep in mind. — Dschoni, Jan 03 '15 at 23:44
It's fine if they are the same lines, I just want the process to be faster. — user0, Jan 03 '15 at 23:52
If a number is bigger than all of its neighbors, what happens? For example in 1D: if you had 1 3 2 4 3 5 ... You would start at 1, move to 3 and then? stay at 3? move to 2? If you move to 2 it then goes to 4. So this won't find the local maximum. — Joel, Jan 03 '15 at 23:53
Regarding the same lines... If you reach a line you've already seen, do you still want to process it? If not, then the code I wrote can be sped up even more by stopping as soon as the line reaches an already observed point. — Joel, Jan 03 '15 at 23:54
I'm starting to get the idea. Can you drop me a line on my e-mail adress (on my profile) because I guess I am working on something pretty similar. Nevertheless it would be important to know, if you plan on accelerating that code on the CPU or the GPU. — Dschoni, Jan 03 '15 at 23:55
I would want to process the line just as the code in the question says — user0, Jan 03 '15 at 23:55
Can you edit your post to include the example of first line: 1 1 1 1, second line 1 3 2 4, third line 1 1 1 1, fourth line 4 5 6 7 and what your output list would be at least the first time through? — Joel, Jan 03 '15 at 23:56
I can't see how Joels 1D example would lead to any kind of oscillation. Can you specify? — Dschoni, Jan 03 '15 at 23:57
I'm heading off. You should edit your question to put `list = []` in the appropriate place. I'll take a look again later. — Joel, Jan 04 '15 at 00:02
Actually going back to my example, I'd be more interested in the outcome of first line: 1 1 1 1, second line 1 3 2 4, third line 1 1 1 3, fourth line 4 5 6 7 — Joel, Jan 04 '15 at 00:03
I will take a look at the code and give a working example with input and output. Appreciate the help — user0, Jan 04 '15 at 00:05
Thanks a lot for your edit. This makes things a lot clearer. One question remains though: How do you deal with lines along the edges of the image? So far, the code will just stop. Other ways to deal with this problem is reflection of the array at the edges, wrapping (running out left means running in right and so on) or simply zero-padding. I'll check back later with my detailed answer. — Dschoni, Jan 04 '15 at 08:02
I just ignore edges. I Will certainly do some research on the methods you describe, but at the moment I am just interested in optimizing the code as is — user0, Jan 04 '15 at 14:39
`Value` is unnecesary here, since `Value['U']` is the same as `array[PixelCoord['U']]` — Eric, Jan 04 '15 at 22:27
What the code does: (a) you iterate over every pixel in a 2D image (b) you perform the following operation: (1) you look at the 8-neighbourhood of the current pixel (2) you find the position of the maximum and save it in a list (3) you change the current pixel to the just found maximum (4) you go back to step (1) until the list is of length X (5) you throw away the list and start with the next pixel in the image // Interpretation in words: From each point of the image, you trace a line along the maximum gradient towards the next local maxima. // Would this interpretation be right? — loli, Jan 05 '15 at 10:52
So, I added my precise and commented answer to what I believe is your problem. I still don't know how you want to handle multiple ocurrences of the maximum in a 3x3 subset, so I just assumed taking the first ocurrence would be enough. Depending on what you wanna do, this could be a false assumption. — Dschoni, Jan 05 '15 at 15:07

score 2 · Answer 1 · edited May 23 '17 at 10:33

For your question to make sense to me, I think you need to move where list = [] appears. Otherwise you'll never get to even i=0, j=1 until list is full. I can't imagine that it is slow as currently implemented --- list will be full very quickly, and then the for loops should be very fast. Here is what I believe you intended. Please clarify if this is not correct.

for i in range (0,array.shape[0]):
    for j in range (0,array.shape[1]):
         list = []
         while len(list) < 100:
                print "identity", i, j

                #find neighboring entry with greatest value (e.g., assume it is [i-1, j] with value 10)
                list.append((i-1,j))
                i = i-1
                j = j
         #perform operations on list

Let's do some modifications. I'll assume there is a function get_max_nbr(i,j) which returns the coordinates of the maximum neighbor. One of the places your code is slow is that it will call get_max_nbr for the same coordinate many times (at each step in the loop it does it 100 times). The code below uses memoization to get around this (down to 1 time on average). So if this is your bottleneck, this should get you close to 100 times speedup.

maxnbr = {}
for i in range(0,array.shape[0]):
    for j in range (0,array.shape[1]):
        list = []
        current_loc = (i,j)
        while len(list) < 100:
            if current_loc not in maxnbr:  #if this is our first time seeing current_loc
                maxnbr[current_loc] = get_max_nbr(*current_loc) #note func(*(i,j)) becomes func(i,j)
            current_loc = maxnbr[current_loc]
            list.append(current_loc)
        #process list here

This doesn't successfully vectorize, but it does create the list (I think) you want, and it should be a significant improvement. It may be that if we knew more about the list processing we might be able to find a better approach, but it's not clear.

score 1 · Answer 2 · answered Dec 28 '14 at 01:11

Very simply, numpy allows for element-wise operation on its arrays without having to loop over each of its dimensions.

So say you want to apply a simple operator on each element, e.g. scalar multiplication by a number 2, then you can do either of the following:

array*2

or

np.multiply( array,2)

Depending on the nature of stuff you are doing within your loop, you may adapt other techniques to do an elementwise operation using vectorization.

Roland Smith · Answer 3 · 2014-12-28T01:57:59.243

Your first concern should be to see if you can carry out your calculations using numpy's element-wise operators.
If that doesn't work, look at the universal functions (ufuncs) built into numpy.

Both of these are coded in compiled C (or Fortran) and are much faster than looping in Python. Additionally, your code will be shorter and easier to understand.

Additional parameters that might improve performance are which compiler was used to compile numpy and which lineair algebra library is used (assuming your code uses linear algebra). E.g. the ATLAS are automatically tuned for the machine they are built on. Intel sells a Fortran compiler and math libraries that are supposed to be very fast on an Intel processor. IIRC, they also parallelize over all available cores.

If your math libraries don't use multiple cores automatically, using the multiprocessing module might be an option. Assuming the problem can be parallelized, this can reduce the runtime (almost) by a factor of 1/N where N is the number of cores. Minus of course the overhead necessary to distribute the problem and gather the results.

Alternatively, for problems that can be parallelized, you could use pyCUDA with numpy if you have a NVidia video card.

Considering that it's still not clear what @Sam actually wants, this is maybe the best answer there is. I would just add, since the problem is highly image and graph related, `scipy.ndimage` and the basic shortest path algorithms (e.g. A*) might help. — loli, Jan 05 '15 at 11:09

score 1 · Answer 4 · answered Jan 03 '15 at 09:31

If your goal is to find the local maxima in the array, you can use scipy.ndimage.filters.maximum_filter with a 3×3 window and then check for equality:

import numpy
import scipy
import scipy.ndimage

arr = numpy.array([[1, 1, 2, 8],
                   [5, 5, 4, 1],
                   [9, 5, 8, 8],
                   [7, 3, 6, 6]])
maxima = zip(*(scipy.ndimage.filters.maximum_filter(arr, 3) == arr).nonzero())

The speed of this will depend greatly on whether you really need to use only the first 100 and how many maxima there are. If so, breaking out early will probably be faster. Refining your question with the real meat of what you are doing will help us to get a better solution, though.

"Refining your question with the real meat of what you are doing will help us to get a better solution, though." Let me echo this. You'll see that this solution is very different from the solution I gave. Depending on what your end goal is, this may or may not be a vastly better solution than mine. But we can't judge it. — Joel, Jan 03 '15 at 10:23

score 1 · Answer 5 · answered Jan 03 '15 at 17:17

1

Adding to the already good answers, here is the commented and fast version to get everything in a list:

import numpy as np
import scipy.ndimage as ndi

#Data generation
data=np.random.randint(100, size=(2000, 2000))
#Maximum extraction using a 3x3 kernel
b=ndi.filters.maximum_filter(data,3) 
#Getting the first 100 entries of b as a 1-D array
max_list=b.flatten()[0:99]

In my test, this code took about 0.2 seconds including the data generation on my Intel i7 CPU and about 3 seconds, when the size of the array is 20k*2k. Timing seems to be no issue here, as I run into memory issues before execution time goes up noticeably.

Nevertheless, you can subdivide the exact same approach into smaller subarrays for larger amounts of data. Keep in mind, that at some point the data handling will take more time than the computation itself.

answered Jan 03 '15 at 17:17

Dschoni

3,192
3
34
61

It's not at all clear to me that the list b you're creating is at all like the list in the original post. I'm not sure that it isn't what the OP was after, but I don't think it's going to be the same thing. – Joel Jan 03 '15 at 22:57
The OP mentioned searching for the neighboring element with the greatest value (aka the brightest pixel in a 3x3 submatrix if I got it right). Of course, especially in image processing, there are other definitions of "neighbour" as well as "maximum" but I admit, that without the OP's comment, this is searching in the dark. The first answer already shows how to access the indices of the maxima by comparing with the list. But as I guess it is an imaging problem, I nevertheless wanted to hint at timing issues compared to memory issues. – Dschoni Jan 03 '15 at 23:36
Your max_list will contain only the values of the maxima, repeated many times. See my answer for finding the indices. – chthonicdaemon Jan 04 '15 at 15:59

Dschoni · Accepted Answer · 2015-01-05T15:10:11.780

So this is my parallell approach. First I create a lookup-table, where every pixel shows the coordinates of the nearest-neighbour maximum. The code runs in about 2 seconds for a 100*100 matrix on my intel i7 dual core cpu. So far, the code is not optimized, data handling inside the multiprocessing is a little weird and can be surely made more easy. Just let me know, if this is, what you want. So far, the code only adds the coordinates of the data-points to the list, if you need the values instead, change at the appropriate points or just parse the resulting lines[] list.

import numpy
import multiprocessing as mp
import time
start=time.time()
#Getting the number of CPUs present
num_cpu=mp.cpu_count()
#Creation of random data for testing
data=numpy.random.randint(1,30,size=(200,200))
x,y=data.shape
#Padding is introduced to cope with the border of the dataset.
#Change if you want other behaviour like wrapping, reflection etc.
def pad(data):
    '''Can be made faster, by using numpys pad function
    if present'''
    a=numpy.zeros((x+2,y+2))
    a[1:-1,1:-1]=data
    return a
data=pad(data)
#Kernel to get only the neighbours, change that if you only want diagonals or other shapes.
kernel=numpy.array([[1,1,1],[1,0,1],[1,1,1]])
result_list=[]  
#Setting up functions for Parallel Processing  
def log_result(result): 
    result_list.append(result) 
def max_getter(pixel):
    '''As this function is going to be used in a parallel processing environment,
    the data has to exist globally in order not to have to pickle it in the subprocess'''
    temp=data[pixel[0]-1:pixel[0]+2,pixel[1]-1:pixel[1]+2].copy()*kernel
    #Getting the submatrix without the central pixel
    compare=numpy.max(temp)==temp
    coords=numpy.nonzero(compare)
    if len(coords[0])!=1:
        coords=(coords[0][0],coords[1][0])
    #discards every maximum which is not the first. Change if you want.
    #Converting back to global coordinates
    return (pixel,(pixel[0]+(numpy.asscalar(coords[0])-1),pixel[1]+(numpy.asscalar(coords[1])-1)))
    #This assumes, that the maximum is unique in the subset, if this is not the case adjust here
def parallell_max():
    pool = mp.Pool() 
#You can experiment using more cores if you have hyperthreading and it's not correctly found by cpu_count
    for i in range(1,x+1):

        for j in range(1,y+1):

            pool.apply_async(max_getter, args = ((i,j),),callback=log_result) 
    pool.close()
    pool.join() 


#___________START Parallel Processing________
if __name__ == '__main__':
   # directions={}
    parallell_max()
    directions={}
    for i in result_list:
        directions[i[0]]=i[1]
    #Directions is a dictionary-like lookup-table, where every pixel gives the next pixel in the line
    lines=[]
    #The following code can also be easily parallelized as seen above.
    for i in range(1,x+1):
        for j in range(1,y+1):
            line=[]
            first,second=i,j
            for k in range(100):
                line.append((first,second))
                first,second=directions[(first,second)]
            lines.append(line)
    stop=time.time()
    print stop-start

Just a "bug" I didn't think of when writing this: I assumed, that only positive integers are used. If you use arbitrary numbers, or floats, some lines have to be adjusted. But the principle should work nevertheless. — Dschoni, Jan 06 '15 at 11:43

Vectorize nested for loop Python

6 Answers6