Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

What does a Kubeflow deployment look like?

A Kubeflow deployment is:

Kubeflow gives you the ability to organize loosely-coupled microservices as a single unit and deploy them to a variety of locations, including on a laptop, on-premises, or in the cloud.

This codelab walks you through creating your own Kubeflow deployment using MiniKF, then running a Kubeflow Pipelines workflow with hyperparameter tuning to train and serve a model. You do all that from inside a Jupyter Notebook.

What you'll build

In this codelab, you will build a complex data science pipeline with hyperparameter tuning on Kubeflow Pipelines, without using any CLI commands or SDKs. Then, you will easily serve the model and make predictions against new data. You don't need to have any Kubernetes or Docker knowledge. Upon completion, your infrastructure will contain:

What you'll learn

What you'll need

This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.

Set up your GCP project

Follow the steps below to create a GCP project or configure your existing GCP project. If you plan to use an existing GCP project, make sure that the project meets the minimum requirements described below. The first step is to open the resource manager in the GCP Console.

Open the GCP resource manager

Create a new project or select an existing project:

99b103929d928576.png

Check the following minimum requirements:

For more help with setting up a GCP project, see the GCP documentation.

After setting up your GCP project, go directly to the instructions for installing MiniKF.

Open your pre-allocated GCP project

To open your pre-allocated GCP project, click the button below to visit the GCP Console and open the Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.

Open the GCP Console

3fdc4329995406a0.png

If the project is not already selected, click Select a project:

e8952c0b96067dea.png

Select your project. You should only have one:

fe25c1925487142.png

AWS

Make sure you have an AWS account and you are able to launch EC2 instances.

MiniKF on AWS

To install MiniKF on AWS, follow this guide.

MiniKF on Google Cloud

In the Google Cloud Marketplace, search for "MiniKF".

Open the GCP Marketplace

Select the MiniKF virtual machine by Arrikto:

d6b423c1911ea85a.png

Click the LAUNCH button and select your project:

7d07439db939b61c.png

In the Configure & Deploy window, choose a name and a zone for your MiniKF instance and leave the default options. Then click on the Deploy button:

7d5f7d17a80a1930.png

Wait for the MiniKF Compute instance to boot up:

5228086caadc44c6.png

SSH to MiniKF

When the MiniKF VM is up, connect and log in by clicking on the SSH button. Follow the on-screen instructions to run the command minikf to see the progress of the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.

774e83c3e96cf7b3.png

Log in to MiniKF

When installation is complete and all pods are ready, visit the MiniKF dashboard and log in to using the MiniKF username and password:

251b0bcdbf6d3c71.png

325ec8340b9f5662.png

Congratulations! You have successfully deployed MiniKF on GCP. You can now create notebooks, write your ML code, run Kubeflow Pipelines, and use Rok for data versioning and reproducibility.

During this section, you will run the OpenVaccine example, a Kaggle competition related to the need to bring the COVID-19 vaccine to mass production. The final model will predict likely degradation rates at each base of an RNA molecule.

Create a notebook server in your Kubeflow cluster

Navigate to the Notebooks link on the Kubeflow central dashboard.

fb32e496c1fe7e6.png

Click on New Server.

f9303c0a182e47f5.png

Specify a name for your notebook server.

a2343f30bc9522ab.png

Make sure you have selected the following Docker image (Note that the image tag may differ):

gcr.io/arrikto/jupyter-kale:f20978e

Click on Launch to create the notebook server.

28c024bcc55cc70a.png

When the notebook server is available, click on Connect to connect to it.

52f1f8234988ceaa.png

Download the data and notebook

A new tab will open up with the JupyterLab landing page. Create a new terminal in JupyterLab.

8427706679170147.png

In the terminal window, run this command to navigate to download the notebook and the data that you will use for the remainder of the lab:

git clone https://github.com/kubeflow-kale/kale

The cloned repository contains a series of curated examples with data and annotated notebooks.

In the sidebar, navigate to the folder kale/examples/openvaccine-kaggle-competition/ and open the notebook open-vaccine.ipynb.

8ba8593b7fe2f2ff.png

Explore the ML code of the OpenVaccine example

Run the imports cell to import all the necessary libraries. Note that the code fails because a library is missing:

99570ea5d75ca5e3.png

Normally, you should create a new Docker image to be able to run this notebook as a Kubeflow pipeline, to include the newly installed libraries. Fortunately, Rok and Kale make sure that any libraries you install during development will find their way to your pipeline, thanks to Rok's snapshotting technology and Kale mounting those snapshotted volumes into the pipeline steps.

Run the cell right above to install the missing libraries:

91999df891de5944.png

Restart the notebook kernel by clicking on the Restart icon:

47a5ffafdfffe399.png

Run the imports cell again with the correct libraries installed and watch it succeed.

Convert your notebook to a pipeline in Kubeflow Pipelines

Enable Kale by clicking on the Kubeflow icon in the left pane of the notebook:

83087d281cf37bac.png

Enable Kale by clicking on the slider in the Kale Deployment Panel:

2ff8df8b7ca29ab8.png

Explore the per-cell dependencies within the notebook. See how multiple notebook cells can be part of a single pipeline step, as indicated by color bars on the left of the cells, and how a pipeline step may depend on previous ones, as indicated by depends on labels above the cells. For example, the image below shows multiple cells that are part of the same pipeline step. They have the same magenta color and they depend on a previous pipeline step named "load_data".

741017ab9e1cdb7c.png

Click on the Compile and Run button

da04e681c828717b.png

Now Kale takes over and builds your notebook, by converting it to a KFP pipeline. Also, since Kale integrates with Rok to take snapshots of the current notebook's data volume, you can watch the progress of the snapshot. Rok takes care of data versioning and reproducing the whole environment as it was when you clicked the Compile and Run button. This way, you have a time machine for your data and code, and your pipeline will run in the same environment where you have developed your code, without needing to build new Docker images.

de1b88af76df1a9a.png

The pipeline was compiled and uploaded to Kubeflow Pipelines. Now click the link to go to the Kubeflow Pipelines UI and view the run.

e0b467e2e7034b5d.png

The Kubeflow Pipelines UI opens in a new tab. Wait for the run to finish.

6b803e7a3a2eb427.png

4d0c88c7ed10bc4.png

Congratulations! You just ran an end-to-end pipeline in Kubeflow Pipelines, starting from your notebook!

Now that you have run a single pipeline, it's time to optimize your model using hyperparameter tuning. We are going to use Katib, Kubeflow's official hyperparameter tuner, to perform this job. Kale will orchestrate Katib and KFPexperiments so that every Katib trial is a pipeline run in Kubeflow Pipelines.

If you go back to your notebook, you will find the following cell at the top, declaring hyperparameters:

a24921a0f79b649c.png

We also need to declare the metric to optimize our model against and we want Kale to create a pipeline which outputs KFP native metrics. Here, we are going to use the validation_loss as a metric. To do this, go to the end of the notebook, and create a new cell.

228fe6d4d5fb5b0c.png

Type:

print(validation_loss)

c2367c69c30ad7a2.png

Then click on the pencil icon to edit this cell and tag it as Pipeline Metrics:

169a6c1ee9f82ded.png

Now, you should have a cell like the following:

6be7cbd006b223bf.png

That's all you need. Now, every time you use Kale to convert this notebook, the resulting pipeline will produce a KFP metric with the value of the validation_loss.

Now, let's enable Katib to run hyperparameter tuning by clicking the toggle on the left pane:

2801b458f330f43.png

Then click on Set up Katib Job to configure Katib:

d42a6bb6e4283e2f.png

Note that Kale is smart enough to auto-detect the HP Tuning parameters and their type from the notebook, since we defined a pipeline-parameters cell.Configure the search space for each parameter (everything was already prefilled for you, all of this information is stored in the notebook metadata, so it's persistent and portable):

55f3c1b9e72fa63c.png

Also, define a search algorithm, a search objective, and run parameters. Everything is prefilled, so you don't need to change anything.

15b8c7fa1dbe303c.png

Close this dialog and then click on the Compile and Run Katib Job button:

2fceb36be30ef1b4.png

Watch the progress of the Katib experiment:

f3514011876564db.png

Click on View to see the Katib experiment:

3da9118b5e54f4c4.png

This is the Katib UI and here you can see the progress of the experiment you just started:

dcb665ac78737622.png

Actually, this is a brand-new UI we have built for Katib from scratch. We have made various improvements to show much more detailed information about the experiment.

When the Katib experiment is completed, you should see a graph like the following:

da217c3b05496984.png

Note that the plot is interactive so that you can explore more easily how the various configurations affected the pipeline metrics.

50024da777682a70.png

Moreover, you can see the configuration of the best trial and its performance:

edf6932727542eb5.png

If you go to the TRIALS tab you will find a list with all the trials. Hover over one of them and see how it gets highlighted on the plot:

e11b3edccce671b4.png

The best Katib trial is highlighted in this list:

edfec48178609876.png

If you click on the pipeline icon to the right you will be redirected to the corresponding Kubeflow pipeline run:

86d3d619e7bae818.png

Here is the run:

a5208948fff9801e.png

Note that some steps have a recycle icon. This means that these steps were cached. When running hyperparameter tuning, you don't need to re-run all the steps, but only those ones that depend on hyperparameters. To take care of caching, we have implemented a caching mechanism that relies on the snapshotting capabilities of Rok.

If you go to the Config tab, you will find more details about the run. You will also find a link back to the original Katib experiment. This cross linking between Katib and KFP is possible, because this workflow was created by Kale, which orchestrates all the various Kubeflow components, and adds the necessary annotations and semantics to make the components communicate between them.

621c0beedc3c05d2.png

Restore notebook from a pipeline step

Let's unpack what happened in the previous step. Kale produced multiple pipeline runs, where each one is fed with a different combination of arguments.

Katib is Kubeflow's component to run general purpose hyperparameter tuning jobs. Katib does not know anything about the jobs that it is actually running (called Trials in the Katib jargon). All that it cares about is the search space, the optimization algorithm, and the goal. Katib supports running simple Jobs (that is, Pods) as trials, but Kale implements a shim to have the Trials actually run pipelines in Kubeflow Pipelines, and then collect the metrics from the pipeline runs.

Now that you have run a hyperparameter optimization on your model, you probably want to take the best model and serve it. We are going to use KF Serving, Kubeflow's component for serving models to production.

What we are going to do is select the best Trial of the Katib experiment and restore a notebook out of a snapshot of this pipeline run. We will use Rok to restore the notebook and we will have the model directly in the notebook's memory. Then, we will use an intuitive Kale API to serve this model, without having to create any Docker images or submit new CRs. Kale will make the whole process seamless.

You are already on the best pipeline run, so click on the Model evaluation step and then go the the Visualizations tab. Here is where you see the pipeline artifacts that Kale produced:

51f9f6afce4ba89a.png

The first artifact is a Rok snapshot taken just before the execution of the Model evaluation step. Click on the corresponding link to view the snapshot in the Rok UI:

45f1c43aac41b65.png

Copy the Rok snapshot URL:

ebfa8e83a4297976.png

Navigate to the Notebooks link:

d5787b3c15e282ab.png

Click on New Server:

cd3894e33193968b.png

Paste the Rok URL you copied previously. The notebook info will get autofilled.

60f71818fcc70239.png

Specify a name for your notebook:

a7d08cef5b701982.png

Click Launch to create the notebook server

3ac75dde9a269727.png

When the notebook server is available, click Connect to connect to it.

bfd47345ad7dc1e2.png

In the background, Kale resumes the Notebook's state by importing all the libraries and loading the variables from the previous steps. Note: when you click on Connect and the new Jupyter tab opens, wait for a few seconds until you see a pop up saying that Kale has completely restored the notebook.

Serve the model from inside the notebook

Kale has unmarshalled all the necessary data so that the current in-memory state is exactly the same as the one that we would have found at that specific point in time in the pipeline execution. Thus, we now have in our memory the model of the best Katib trial. You should now see the following screen:

65bd8d0144c0155a.png

Create a new cell, type model and run the cell to verify that the model is indeed in memory.

c3dddb2e3f26fcaf.png

Let's now serve this model with KFServing using the Kale API. Go to the end of the notebook, add a cell, type the following command, and then run it:

from kale.common.serveutils import serve

bc7a78721838a660.png

Serving the model is now as easy as running a simple command. In this tutorial, we will also pass a preprocessing function and then serve the model. We first have to define the preprocessing function and the tokenizer (Note: this function is not already in-memory because the pipeline step from where we restored this notebook wasn't using it). Find the Preprocess Data section in your notebook and run the following cells.

ddd0c71ba3822706.png

11e1ff3a9b36544a.png

Now, go again at the end of your notebook, create a new cell, type the following command, and run it.

kfserver = serve(model, preprocessing_fn=process_features, preprocessing_assets={'tokenizer': tokenizer})

e8f57b2abe60f427.png

Kale recognizes the type of the model, dumps it, saves it in a specific format, then takes a Rok snapshot and creates an inference service.

After a few seconds, a serving server will be up and running.

1d6b5e55d3832a0d.png

Now, create a new cell, type kfserver and run it to see where the model is served.

15bb2c1b826c428.png

If you click on the model link you will navigate to the brand-new Models UI.

d2ce8daf8f3f567b.png

We have built a new UI for Kubeflow to expose the entire state of KF Serving. Here, you can monitor all the inference services that you deploy, see details, metrics, and logs. This is the page of our model on the Models UI.

7eb88d5a0b2efaea.png

You can see more details on the Details tab.

6e188ead57837520.png

You can also see live metrics on the Metrics tab

9ada825dfc2a09d0.png

Live logs on the Logs tab

7086e5f98a7ef20e.png

And finally the model definition at the YAML tab

65d467dd7fbfc5b8.png

And this is the homepage of the Models UI, where you can find a summary of all your model servers, create new ones, and delete existing ones.

2c0c8220cfed3b22.png

Let's now go back to the notebook, and hit the model to get a prediction. First, we need to create the payload to send to the model. We will create a dictionary with an instances key, a standard way to send data to a TensorFlow server. We will also pass an unprocessed dataset that we defined previously in the notebook. To do all of the above, type the following command and run it:

data = json.dumps({"instances": unprocessed_x_public_test})

6ee33007e76b9c52.png

Now, let's send a request to our model server. Type and run this simple command:

predictions = kfserver.predict(data)

b8f94bcb9c699000.png

If you go back to the model page on the Models UI, you can see live logs of the transformer receiving unprocessed input data and then processing them.

26de8342ffd3c07.png

Everything happened automatically without having to build any docker images. Kale detected, parsed, and packaged the input processing function and its related assets. Then created a new transformer, initialized it, and used it to process the raw data.

If you go back to your notebook, you can print the predictions by simply adding a new cell and running this command:

d0a5c6c9a3f8f076.png

This may take some time. Eventually you will see something like this:

92f79959bdcb5961.png

Congratulations, you have successfully run an end-to-end data science workflow using Kubeflow (MiniKF), Kale, and Rok!

What's next?

Join the Kubeflow Community:

Further reading