Rationale for minimal Jupyter image
We often are asked why we don’t pre-install various Python data science packages (numpy, scipy, pandas etc.) in our Jupyter Docker image. Many of these packages have native extensions that have to be compiled, so installation can take a while and this can be annoying if you’re a heavy user of those packages as you have to wait for this installation every time you start a new Docker container using our image. We chose to not pre-install these packages because one of our major use cases involved an ephemeral compute environment in which new containers are created frequently. In this environment, minimizing the size of the docker image helps reduce the startup time before Jupyter is responsive.
We keep our image small with two main strategies: using Alpine Linux and installing language kernels on the fly. Our image is currently1 around 250MB uncompressed, and only the Python 2 kernel is pre-installed. By comparison, the Jupyter project’s official minimal image containing Python 3 and miniconda is about 2.1GB as of this writing. Their scipy image, with Python 2 & 3 are several pre-installed data science packages, is 4.0GB. Their data science image, which adds R and Julia to the scipy image, is 5.6GB.
As an experiment, we tried adding in some of the popular data science packages into our image: numpy, scipy, matplotlib, pandas, seaborn and sklearn. That roughly doubled the size to around 500MB. Adding Python 3 with the same packages increased the size to around 900MB. These sizes were not ideal for our execution environment. In addition, some of our customers use other languages besides Python, and it’s just not feasible to pre-install all these languages and their most popular packages.
So, back to those Python data science packages. We can’t bake them into the image, but our users don’t like waiting for them to compile and install. Fortunately, we think we’ve come up with a reasonable compromise. Since most of the installation time for these packages is spent doing the native compilation, we build pre-compiled Alpine packages (apks) for some of the most popular packages. In fact, the official Alpine repository already does this for some packages, and we’ve filled in some of the gaps with our own repository.
That just leaves how to install these packages. We can’t expect the user to know they need to do something beyond a pip install, and if the users has to open a terminal and install it manually, we certainly haven’t made the process any faster. This is where ipydeps comes in. Our ipydeps package provides a simple wrapper around pip that enables you to install packages programmatically in your Jupyter notebook. Further, ipydeps can be configured to transparently intercept the pip command and install an apk instead. So, for example, if you want to use numpy, all you have to do is make a call in your notebook to have ipydeps install it, and behind the scenes we’ll install the pre-compiled apk instead.
1 This post was written in early 2017. As of March 2018, version 7 of the image is about 340MB uncompressed, with both Python 2 and 3 pre-installed.