0

First things first, I've already read the following:

And some more links from the first one, but none of them worked...

My problem is with opening huge (>80 Mb/pc.) and numerous (~3000) FITS files in Jupyter Notebook. The relevant code snippet is the following:

# Dictionary to store NxN data matrices of cropped image tiles
CroppedObjects = {}

# Defining some other, here used variable....
# ...

# Interate over all images ('j'), which contain the current object, indexed by 'i'
for i in range(0, len(finalObjects)):
    for j in range(0, len(containingImages[containedObj[i]])):

        countImages += 1

        # Path to the current image: 'mnt/...'
        current_image_path = ImagePaths[int(containingImages[containedObj[i]][j])]

        # Open .fits images
        with fits.open(current_image_path, memmap=False) as hdul:
            # Collect image data
            image_data = fits.getdata(current_image_path)

            # Collect WCS data from the current .fits's header
            ImageWCS = wcs.WCS(hdul[1].header)

            # Cropping parameters:
            # 1. Sky-coordinates of the croppable object
            # 2. Size of the crop, already defined above
            Coordinates = coordinates.SkyCoord(finalObjects[i][1]*u.deg,finalObjects[i][2]*u.deg, frame='fk5')
            size = (cropSize*u.pixel, cropSize*u.pixel)

            try:
                # Cut out the image tile
                cutout = Cutout2D(image_data, position=Coordinates, size=size, wcs=ImageWCS, mode='strict')

                # Write the cutout to a new FITS file
                cutout_filename = "Cropped_Images_Sorted/Cropped_" + str(containedObj[i]) + current_image_path[-23:]

                # Sava data to dictionary
                CroppedObjects[cutout_filename] = cutout.data

                foundImages += 1

            except:
                pass

            else:
                del image_data
                continue

        # Memory maintainance                
        gc.collect()

        # Progress bar
        sys.stdout.write("\rProgress: [{0}{1}] {2:.3f}%\tElapsed: {3}\tRemaining: {4}  {5}".format(u'\u2588' * int(countImages/allCrops * progressbar_width),
                                                                                                   u'\u2591' * (progressbar_width - int(countImages/allCrops * progressbar_width)),
                                                                                                   countImages/allCrops * 100,
                                                                                                   datetime.now()-starttime,
                                                                                                   (datetime.now()-starttime)/countImages * (allCrops - countImages),
                                                                                                   foundImages))

        sys.stdout.flush()

So ok, it does actually three things:

  1. Opens a particular FITS file
  2. Cuts a square out of it (but strictly, so if arrays only overlap partly, the try statement jumps to the next step in the loop)
  3. Updates the progress bar

Then goes to the next file, does the same things and iterates over all of my FITS files.

BUT: If I try running this code, after approximately 1000 found pictures, it stops and gives and OSError: [Errno 24] Too many open files on the line:

image_data = fits.getdata(current_image_path)

I tried everything, which was supposed to solve the problem, but nothing helped... Not even setting memory mapping to false or using fits.getdata and gc.collect()... Also tried many minor changes, like running without the try statement, cutting out all of the image tiles, without any limitations. The del inside the else statement is also another miserable attempt by me. What else can I try to make this finally work?
Also, feel free to ask me if somethings not clear! I'll also try to help you to understand the problem!

Balázs Pál
  • 33
  • 1
  • 6

2 Answers2

2

I've had a similar issue in the past (see here). In the end I made it work roughly like this:

total = 0
for filename in filenames:
    with fits.open(filename, memmap=False) as hdulist:
        data = hdulist['spam'].data
    total += data.sum()

Some notes:

  • use fits.open to open the file, with memmap=False
  • use it in a with block, to make the file closing reliable
  • keep the with block short, just load the data you need into memory, then close the file by exiting it
  • do what you need to do with the data after the file is closed; that might not really be needed, but if Python references to data in the file are the issue that prevent it from being closed, this simplifies the situation. I don't think the cutout code is the problem in your example, but it could be - try uncommenting it?
  • don't do an extra fits.getdata which I think opens the file again
  • the del and gc.collect shouldn't be needed, if the code is simple as suggested here, there will not be circular references and Python will reliably delete objects at the end of scopes

Now it's possible that this will not help, you'll still have the issue. In that case, the way to proceed is to make a minimal reproducible example that doesn't work for you that Astropy devs can run (like I did here), and then to file an issue with Astropy, giving your Python version, Astropy version and operating system, or post it here. The point is: this is complex and likely runtime / version dependent, so to try to pin it down an example anyone can run, that fails for you, is needed.

Christoph
  • 2,210
  • 1
  • 15
  • 22
  • Yes, so I deleted `del` and `gc.collect()` and changed the `fits.getdata` line to `image_data = hdul[1].data`. Also used `memmap=False` at opening files. I tried some changes, summarized under the comment of @Iguananaut, but they didn't worked, I encounter two other errors. I just tried uncommenting the cutout code, as you mentioned... It finished without any problems. Now I don't understand... – Balázs Pál Apr 03 '19 at 21:42
2

This line is what's hurting you:

image_data = fits.getdata(current_image_path)

You've just opened that file on the previous line with memmap=False, but with that line you're re-opening it with memmap=True and holding the file open when you keep a reference to image_data by way of wrapping it in a Cutout2D and then keeping a reference to the data with:

CroppedObjects[cutout_filename] = cutout.data

As far as I know, Cutout2D doesn't necessarily make a copy of the data if it doesn't have to, so you're still effectively holding open a reference to image_data which is mmap'd.

Solution: Don't use fits.getdata here. See the warning about this in the docs:

These functions are useful for interactive Python sessions and simple analysis scripts, but should not be used for application code, as they are highly inefficient. For example, each call to getval() requires re-parsing the entire FITS file. Code that makes repeated use of these functions should instead open the file with open() and access the data structures directly.

So in your case you want to replace the line:

image_data = fits.getdata(current_image_path)

with

image_data = hdul[1].data

As @Christoph wrote in his answer, get rid of all the del image_data and gc.collect() stuff since it's not helping you anyways.

Addendum: From the API docs for Cutout2D:

If False (default), then the cutout data will be a view into the original data array. If True, then the cutout data will hold a copy of the original data array.

So this is stating explicitly (and I confirmed this with a peek at the code) that the Cutout2D is just taking a view of the original data array, meaning it's holding on to a reference to it. You could probably avoid this, if you want, by calling Cutout2D(..., copy=True). If you did that, you could probably do away with the memmap=False as well. Using mmap may or may not be useful: It depends in part on the sizes of the images and how much physical RAM you have available. In your case it might be faster since you are not using the entire images, and are just taking cutouts of them. This means that using memmap=True might be more efficient as it may allow avoiding paging the entire image array into memory.

But this could also also depends on a lot of things, so you might want to do some performance testing with fits.open(..., memmap=False)+Cutout2D(..., copy=False) versus fits.open(..., memmap=True)+Cutout2D(..., copy=True), perhaps with a smaller number of files.

Iguananaut
  • 15,675
  • 4
  • 43
  • 50
  • Ok, now comes the funny part: First I changed the following line: `image_data = hdul[1].data` And set `memmap=False` in `fits.open`, along with `copy=False` in `Cutout2D`. Now it throws an error: `cannot reshape array of size 0 into shape (4176,2048)` for the `image_data = hdul[1].data` line, no matter what I do. Ok, after that i set `memmap=True` and `copy=True`, and after a while the kernel just died... two times in a row... probably would third time too... – Balázs Pál Apr 03 '19 at 21:07
  • Are you sure `hdul[1]` is the correct HDU for your data? I just guessed that was the case because you were using the header from `hdul[1]` for the WCS. But it could be that you need the data from a different HDU. The one thing that's useful about `fits.getdata()` is that it returns the data array of the first HDU containing a non-empty array (perhaps this would be a useful method to have on `HDUList`). – Iguananaut Apr 04 '19 at 11:12
  • Also does that happen with any file, or just one in particular? It's possible one of your files is corrupt. – Iguananaut Apr 04 '19 at 11:15
  • Yes, i'm totally sure. The truth, that I actually runned this code many times succesfully in the past, it's part of a greater project. I only changed some code at another part of this project, to get some more accurate data. (Actually just upgraded the code, which collects coordinates of deep sky objects on these pictures, it shouldn't affect anything.) Since then, I noticed, that my Jupyter kernel dies always at 99.X%. Even if I delete the return statement, and my progress bar... Maybe can't jump out from the loop for some reason? I'm really confused at this point... – Balázs Pál Apr 04 '19 at 20:23
  • Also there was a big server upgrade since I last runned this code. They limited RAM for every user down to 1 Gb. I really don't know at this point, but maybe this could be related somehow... There are literally nothing else changed, since I runned this code successfully last time, and my recent code upgrade at another part could not interfere with this particular code. – Balázs Pál Apr 04 '19 at 20:27
  • I don't know what to tell you. What I can tell you is that the answer I gave you is correct in general (disclaimer: I wrote 90% of `astropy.io.fits`). Beyond that your problem seems very localized. You might have a corrupt file or be running out of memory. Hard to tell. Please see my [general tips on debugging small Python programs](https://stackoverflow.com/questions/52353953/my-python-code-that-converts-numbers-between-bases-has-several-errors-what-coul/52354973#52354973). You can insert a breakpoint anywhere in your code with `import pdb; pdb.set_trace()`.Works in notebooks too. – Iguananaut Apr 05 '19 at 11:13
  • 1
    Ok, I figured it out...... It was the `continue` statement in the `try-else`... I changed it to `pass` and now along with your changes, it just works fine. Thank you for your detailed comment, it really helped a lot, I wasn't aware of the `copy` argumentum of `Cutout2D`. That helped me the most. – Balázs Pál Apr 05 '19 at 13:28