9

Using the mrjob to run python code on Amazon's Elastic MapReduce I have successfully found a way to upgrade the EMR image's numpy and scipy.

Running from console the following commands work:

    tar -cvf py_bundle.tar mymain.py Utils.py numpy-1.6.1.tar.gz scipy-0.9.0.tar.gz

    gzip py_bundle.tar 

    python my_mapper.py -r emr --python-archive py_bundle.tar.gz --bootstrap-python-package numpy-1.6.1.tar.gz --bootstrap-python-package scipy-0.9.0.tar.gz > output.txt 

This successfully bootstraps the latest numpy and scipy into the image and works perfectly. My question is a matter of speed. This takes 21 minutes to install itself on a small instance.

Does anyone have any idea how to speed up the process of upgrading numpy and scipy?

John Vandenberg
  • 187
  • 1
  • 12
jtman
  • 113
  • 2
  • 6
  • Your problem is that it's the small instance that is slow. I think you won't see any real speedup unless you move to larger Amazon instances. Is this 21 minutes over and above the ~5-6 minutes that it usually requires for EC2 to spin up the instances at all? – ely Jan 11 '12 at 01:30
  • 1
    I agree that communication with the original spin up takes a long time itself. Someone in the mrjob community recommended doing this install for a worker instance, then using ssh to log into the worker instance, download the completed install directory. Then I just pass that completed install directory as a zip with my files. Python chooses to use the local NumPy and SciPy instead of the hadoop's installed versions. – jtman Jan 17 '12 at 14:53

2 Answers2

5

The only way to do anything to an EMR image is by using bootstrap actions. Doing this from the console means you'll only change the master node and not the task nodes which do the processing. Bootstrap actions run once at startup on all nodes and can be a simple script that gets shell exec'd.

elastic-mapreduce --create --bootstrap-action "s3://bucket/path/to/script" ...

To speed up changes to the EMR image, tar up the post-installed files and upload to S3. Then use a bootstrap action to download and deploy. You will have to keep separate archives for 32 bit (micro, small, medium) and 64 bit machines.

The command to download from S3 in the script is:

hadoop fs -get s3://bucket/path/to/archive /tmp/archive
nkadwa
  • 811
  • 8
  • 16
3

The current answer to this question is that NumPy is already installed on EMR, now.

If you want to update NumPy to a more recent version than the one available, you can run a script (as a bootstrap action) that does sudo yum -y install numpy. NumPy is then installed in no time.

Eric O Lebigot
  • 81,422
  • 40
  • 198
  • 249
  • 1
    Can you provide any references for this ? – Abhinav Vishak May 31 '18 at 20:50
  • 1
    At the time of the writing, you could just log into an EMR instance and see in a Python shell that import numpy worked. Similarly yum was installed and upgraded Python as well. The only thing I could imagine may have changed is that pip might be necessary if yum doesn't work. – Eric O Lebigot Jun 01 '18 at 07:54