8

Is it possible to install python packages in a Google Dataproc cluster after the cluster is created and running?

I tried to use "pip install xxxxxxx" in the master command line but it does not seem to work.

Google's Dataproc documentation does not mention this situation.

Pablo Brenner
  • 143
  • 1
  • 8

1 Answers1

8

This is generally not possible after cluster is created. I recommend using an initialization action to do this.

As you've noticed, pip is also not available by default. So you'll want to run easy_install pip followed by pip install command.

Finally, if your intention is to use this cluster in any automation, and/or you want hermeticness, I recommend creating a wheel that you store in GCS and download in init action. You'd then install your wheel. Wheels have added benefit of being faster than installing many packages from pip directly.

2019 Update

See this tutorial on how to configure Python environment on Dataproc: https://cloud.google.com/dataproc/docs/tutorials/python-configuration

tix
  • 1,998
  • 6
  • 17
  • Thanks a lot, when using the Jupyter Notebook I was able to install the packages using !pip install package in the notebook. – Pablo Brenner May 15 '18 at 19:02
  • Great article on setting up production pyspark jobs if that's what your after, including bundling modules using a Makefile and deploying when running jobs: https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f – Daniel Messias Jun 08 '18 at 09:08