1

After having worked with it for a while, I would like to understand how Colab really works and whereas it is safe to work with confidential data in it.

A bit of context. I understand the differences between Python, IPython and Jupyter Notebook described in here. and I would summarize it by saying Python is a programming language and can be installed as any other application with sudo apt-get). IPython is an interactive command-line terminal for Python and can be installed with pip, the standard package manager for Python. It allows you to install and manage additional packages writen in Python that are not part of the Python standard library. Jupyter Notebook add a web interface to and it can use several kernels or backends being IPython one of them.

What about Colab? It is my understanding than when using Colab, I get a VM from google with Python pre-installed as well as many other libraries (aka packages) like pandas or matplotlib. These packages are all installed in the base python installation.

Colab VMs comes with some ephemeral storage. This is equivalent to instance storage in AWS. So it will be lost when the VM runtime is interrupted, i.e. our VM is stopped (or would you rather say...terminated?) by Google. I believe that if I were to upload my confidential data there it will not be in my private subnet...

Mounting our Drive is hence equivalent of using an EBS volume in AWS. An EBS volume is network attached drive so the daat in it will persist even if the VM runtime is interrupted. EBS volumes can however be attached to only one EC2 instance... but I can mount my Drive to several Colab sessions. Not exactly clear to me what these sessions are... Some users would like to create virtual environments in Colab and it looks like mounting the drive is a way to get around it.

When mounting our Drive to Colab, we need to authentificate because we are giving to the IP of the Colab VM access to our private subnet. Hence, if we had some confidential data, by using Colab the data would not be leaving our private company subnet...?

G. Macia
  • 514
  • 5
  • 20

1 Answers1

1

IIUC, the last paragraph asks the question: "Can I use IP-based authentication to restrict access to data in Colab?"

The answer is no: network address filtering cannot provide meaningful access restrictions in Colab.

Colab is a service rather than a machine. Colab backends do not have fixed IP addresses or a fixed IP address range. By analogy, there's no list of IP addresses for restricting access to a particular set of Google Drive users since, of course, Google Drive users don't have a fixed IP address. Colab users and backends are similar.

Instead of attempting to restrict access to IPs, you'll want to restrict access to particular Google accounts, perhaps using typical Drive file ACLs.

Bob Smith
  • 26,929
  • 9
  • 72
  • 69