0

I need to use the mnist Python library for downloading and reading MNIST data (PyPi, Github):

import mnist

mnist_dataset = mnist.test_images().astype(np.float32)

On my university cluster I load the data without problem. Locally on my PC, however, I get:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-24-78f59b728818> in <module>
      8 DATASET_SIZE = 512
      9 DIGIT_SIZE = 28
---> 10 mnist_dataset = mnist.test_images().astype(np.float32)
     11 np.random.shuffle(mnist_dataset)
     12 mnist_dataset = np.reshape(mnist_dataset[:DATASET_SIZE] / 255.0, newshape=(DATASET_SIZE, DIGIT_SIZE*DIGIT_SIZE))

~\anaconda3\lib\site-packages\mnist\__init__.py in test_images()
    174         columns of the image
    175     """
--> 176     return download_and_parse_mnist_file('t10k-images-idx3-ubyte.gz')
    177 
    178 

~\anaconda3\lib\site-packages\mnist\__init__.py in download_and_parse_mnist_file(fname, target_dir, force)
    141         Numpy array with the dimensions and the data in the IDX file
    142     """
--> 143     fname = download_file(fname, target_dir=target_dir, force=force)
    144     fopen = gzip.open if os.path.splitext(fname)[1] == '.gz' else open
    145     with fopen(fname, 'rb') as fd:

~\anaconda3\lib\site-packages\mnist\__init__.py in download_file(fname, target_dir, force)
     57     if force or not os.path.isfile(target_fname):
     58         url = urljoin(datasets_url, fname)
---> 59         urlretrieve(url, target_fname)
     60 
     61     return target_fname

~\anaconda3\lib\urllib\request.py in urlretrieve(url, filename, reporthook, data)
    245     url_type, path = splittype(url)
    246 
--> 247     with contextlib.closing(urlopen(url, data)) as fp:
    248         headers = fp.info()
    249 

~\anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~\anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~\anaconda3\lib\urllib\request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response

~\anaconda3\lib\urllib\request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~\anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~\anaconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

I have inspected the functions through inspect module. The HTTP addresses called by both local and cluster versions are the same, i.e. http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz. I can download it without problems from my web browser.

What can I do about this in Python?

desertnaut
  • 46,107
  • 19
  • 109
  • 140
qalis
  • 636
  • 4
  • 18
  • error can means it has problem to download it - maybe it was waiting for data too long, or it is problem with too slow internet, or file is big and when it loose connection then it can't reconect and continue downloading. But I think there should be function to use already downloaded file. OR you should uncompress .gz to correct folder and skip downloading it. – furas Mar 05 '21 at 15:40
  • 1
    this module is 4 year old and maybe it uses wrong urls. Or server change some settings and it blocks connections from bots/scripts/spamers/hacker for security reason. – furas Mar 05 '21 at 15:49
  • 1
    Searching with `mnist` and the error message: found a related SO Q&A: [HTTPError: HTTP Error 403: Forbidden on Google Colab](https://stackoverflow.com/questions/60538059/httperror-http-error-403-forbidden-on-google-colab) - following that and others it looks like you will probably not be able to use the mnist package without tweaking the calls that download and unzip the data. – wwii Mar 05 '21 at 15:56

3 Answers3

2

This module is very old and archived.

I expected that it may use new server with new security system and code may need some settings - like header User-Agent - to correctly access data.


Using suggestion from @wwii comment I downloaded source code and added User-Agent
and now I can download images

mint/__init__.py

try:
    #from urllib.request import urlretrieve            # before
    from urllib.request import urlretrieve, URLopener  # after
except ImportError:
    #from urllib import urlretrieve # py2            # before
    from urllib import urlretrieve, URLopener # py2  # after


# ... code ...


def download_file(fname, target_dir=None, force=False):

    # ... code ...

    if force or not os.path.isfile(target_fname):
        url = urljoin(datasets_url, fname)

        # before
        #urlretrieve(url, target_fname)   

        # after
        opener = URLopener()
        opener.addheader('User-Agent', "Mozilla/5.0")
        opener.retrieve(url, target_fname)

Test code:

import mnist
import numpy as np

print(mnist.__file__)         # to see if I uses local version with changes
print(mnist.datasets_url)     
print(mnist.temporary_dir())  # to see where it is downloaded

mnist_dataset = mnist.test_images().astype(np.float32)
print(mnist_dataset)

Tested only with Python 3.8 and Python 2.x would need

furas
  • 95,376
  • 7
  • 74
  • 111
2

Another solution would be to roll-your-own to download the files. Using the function I found here and the file urls from the mnist site ...

import  requests, gzip

urls = [(r'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz','training_images.gz'),
        (r'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz','training_labels.gz'),
        (r'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz','test_images.gz'),
        (r'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz','test_labels.gz')]

def download_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with open(save_path, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)
for url,path in urls:
    download_url(url,path)

That works fine, it just downloads the zip files to the current working directory. You would still need to unzip them.

wwii
  • 19,802
  • 6
  • 32
  • 69
2

Using the other answers I have been able to build a solution that allows direct usage of the package.

The following code has to be executed once and work globally:

from six.moves import urllib

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

With this all mnist features can be used and files will be downloaded as needed. I execute the code above in Jupyter Notebook in a cell directly before calling the mnist_dataset = mnist.test_images().astype(np.float32).

qalis
  • 636
  • 4
  • 18