GCP Dataproc custom image Python environment

Question

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env variable to force pyspark to use python3. But when I submit a job on a cluster created (with single node flag for simplicity) with this image, the job can't find the packages installed. If I log on the cluster machine and run the pyspark command, starts the Anaconda PySpark, but if I log on with root user and run pyspark I have the pyspark with python 3.5.3. This is a very strange. What I don't understand is which user is used to create the image? Why I have a different environment for my user and root user? I expect that the image is provisioned with root user, so I expect that all my packages installed could be found from root user. Thanks in advance

Dagang · Accepted Answer · 2021-05-03T20:41:31.533

Updated answer (Q2 2021)

The customize_conda.sh script is the recommended way of customizing Conda env for custom images.

If you need more than the script does, you can read the code and create your own script, but usually you want to use the absolute path e.g., /opt/conda/anaconda/bin/conda, /opt/conda/anaconda/bin/pip, /opt/conda/miniconda3/bin/conda, /opt/conda/miniconda3/bin/pip to install/uninstall packages for the Anaconda/Miniconda env.

Original answer

I'd recommend you first read Configure the cluster's Python environment which gives an overview of Dataproc's Python environment on different image versions, as well as instructions on how to install packages and select Python for PySpark jobs.

In your case, 1.4 already comes with miniconda3. Init actions and jobs are executed as root. /etc/profile.d/effective-python.sh is executed to initialize the Python environment when creating the cluster. But due to the order of custom image script (first) and (then) optional component activation order, miniconda3 was not yet initialized at custom image build time, so your script actually customizes the OS's system Python, then during cluster creation time, miniconda3 initializes Python which overrides the OS's system Python.

I found a solution that, in your custom image script, add this code at the beginning, it will put you in the same Python environment as that of your jobs:

# This is /usr/bin/python
which python 

# Activate miniconda3 optional component.
cat >>/etc/google-dataproc/dataproc.properties <<EOF
dataproc.components.activate=miniconda3
EOF
bash /usr/local/share/google/dataproc/bdutil/components/activate/miniconda3.sh
source /etc/profile.d/effective-python.sh

# Now this is /opt/conda/default/bin/python
which python

then you could install packages, e.g.:

conda install <package> -y

Thank you for your suggestions! I made the image but when I create the cluster whit this image I have an error, anche cluster can't be created. The error is: Failed to initialize node cluster-py-m: Optional component miniconda3 failed to initialize. It happens when the Google cluster startup script runs: cmd='activate_component miniconda3' — Claudio, Jul 17 '19 at 09:32
Yea, I reproduced the problem. I think you might need to modify miniconda activation script in custom image script. Miniconda is supposed to be activated during cluster creation. I can do a test and will reply later. — Dagang, Jul 17 '19 at 16:37
Seems you have to install `conda` in addition to other packages: `conda install conda -y`. — Dagang, Jul 17 '19 at 21:55

GCP Dataproc custom image Python environment

1 Answers1

Updated answer (Q2 2021)

Original answer

Linked