Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

What you'll need

What does a Kubeflow deployment look like?

A Kubeflow deployment is:

Kubeflow gives you the ability to organize loosely-coupled microservices as a single unit and deploy them to a variety of locations, including on a laptop, on-premises, or in the cloud.

This codelab walks you through creating your own Kubeflow deployment using MiniKF, then running a completely automated workflow to discover and train machine learning models starting from raw data. You do all that from inside a Jupyter Notebook.

What you'll build

In this codelab, you will create a mechanism to automate the discovery and training of Machine Learning models, starting from raw data. You will start from a dataset, and then call a Kale API to instantiate a process that will search for the best suitable model for the input data. Then, you will automatically perform hyperparameter optimization on this model. Upon completion, your infrastructure will contain:

What you'll learn

This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.

AWS

Make sure you have an AWS account and you are able to launch EC2 instances.

Set up your GCP project

Follow the steps below to create a GCP project or configure your existing GCP project. If you plan to use an existing GCP project, make sure that the project meets the minimum requirements described below. The first step is to open the resource manager in the GCP Console.

Open the GCP resource manager

Create a new project or select an existing project:

99b103929d928576.png

Check the following minimum requirements:

For more help with setting up a GCP project, see the GCP documentation.

After setting up your GCP project, go directly to the instructions for installing MiniKF.

Open your pre-allocated GCP project

To open your pre-allocated GCP project, click the button below to visit the GCP Console and open the Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.

Open the GCP Console

3fdc4329995406a0.png

If the project is not already selected, click Select a project:

e8952c0b96067dea.png

Select your project. You should only have one:

fe25c1925487142.png

Vagrant

Make sure you have installed Vagrant and VirtualBox.

MiniKF on AWS

To install MiniKF on AWS, follow this guide.

MiniKF on Google Cloud

In the Google Cloud Marketplace, search for "MiniKF".

Open the GCP Marketplace

Select the MiniKF virtual machine by Arrikto:

d6b423c1911ea85a.png

Click the LAUNCH button and select your project:

7d07439db939b61c.png

In the Configure & Deploy window, choose a name and a zone for your MiniKF instance and leave the default options. Then click on the Deploy button:

7d5f7d17a80a1930.png

Wait for the MiniKF Compute instance to boot up:

5228086caadc44c6.png

SSH to MiniKF

When the MiniKF VM is up, connect and log in by clicking on the SSH button. Follow the on-screen instructions to run the command minikf to see the progress of the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.

774e83c3e96cf7b3.png

Log in to MiniKF

When installation is complete and all pods are ready, visit the MiniKF dashboard and log in using the MiniKF username and password:

251b0bcdbf6d3c71.png

325ec8340b9f5662.png

Congratulations! You have successfully deployed MiniKF on GCP. You can now create notebooks, write your ML code, run Kubeflow Pipelines, and use Rok for data versioning and reproducibility.

MiniKF on Vagrant

To install MiniKF on your desktop/laptop using Vagrant, follow this guide.

During this section, you will run the Blue Book for Bulldozers example, a Kaggle competition. The goal is to predict the sale price of bulldozers sold at auctions.

Create a notebook server in your Kubeflow cluster

Navigate to the Notebooks link on the Kubeflow central dashboard.

5a702aa9ec05e8a6.png

Click on New Server.

f9303c0a182e47f5.png

Specify a name for your notebook server.

a2343f30bc9522ab.png

Make sure you have selected the following Docker image.

gcr.io/arrikto/jupyter-kale-py36:kubecon21eu-automl-nightly

Click on Launch to create the notebook server.

28c024bcc55cc70a.png

When the notebook server is available, click on Connect to connect to it.

905aa2469758c7ad.png

Download the data and notebook

A new tab will open up with the JupyterLab landing page. Create a new terminal in JupyterLab.

8427706679170147.png

In the terminal window, run this command to navigate to download the notebook and the data that you will use for the remainder of the lab:

git clone https://github.com/kubeflow-kale/kale -b kubecon21eu

The cloned repository contains a series of curated examples with data and annotated notebooks.

In the sidebar, navigate to the folder kale/examples/bulldozers-kaggle-competition/ and open the notebook blue-book-bulldozers.ipynb.

50fcd40f9d90e409.png

In this section, you will create an AutoML Workflow starting from your Notebook.

Introduction to AutoML

AutoML is a very broad concept, it includes several Machine Learning techniques such as Hyperparameter Tuning (HP), Neural Architecture Search (NAS), and Meta-Learning. In this tutorial, when talking about AutoML, we refer to a combination of HP and Meta-Learning.

If you don't know how Kale can simplify doing HP in Kubeflow, we suggest this ( https://arrik.to/democ2p) tutorial as well. It is not a prerequisite to complete this tutorial, so just keep it in mind and come back to it later on.

Meta-Learning is a particular technique that focuses on "learning to learn". It usually refers to a system able to improve the learning of complex tasks, by reusing previous experience. More concretely, given just a dataset and a task (e.g., classification or regression) as input, the Meta-Learning system employed by Kale suggests some model architectures that it thinks will perform well. This suggestion system is based on previous knowledge on a wealth of other datasets, and how closely your input data relates to some of them.

Consequently, you are able to discover and test hundreds of candidate models without breaking a sweat!

High-level overview of the tutorial

This a high level overview of the components that the Kale AutoML experiments orchestrates.

It all starts from the notebook, where you call a Kale API providing a dataset and task. Then Kale takes over, you can sit back and relax, and watch the following process unfold as Kale:

  1. uses Meta-Learning to analyze the input dataset and suggests some model architectures that should perform well
  2. starts KFP pipelines to train these model architectures
  3. collects all the results from the previous step. As soon as all pipelines are done, it retrieves the best performing one, based on the target metric
  4. starts a new Katib HP Tuning experiment on the best model from (2), to further optimize its initialization parameters
  5. continually logs models, datasets, and TensorBoard reports to MLMD, using reproducible Rok snapshots. You can have a complete lineage if your experiment and everything is persisted into immutable Rok snapshots

Finally, once all of this is done, you can go back to your notebook and, with a single Kale API call, you can:

You can do this anytime, since all models and TensorBoard reports are immutable Rok snapshots.

Now that you have a high level picture of the entire workflow, let's dive into the single parts.

Prepare your environment

Install the necessary dependencies.

784635aae8e35ed1.png

Normally, you should create a new Docker image to be able to run this notebook as a Kubeflow pipeline, to include the newly installed libraries. Fortunately, Rok and Kale make sure that any libraries you install during development will find their way to your pipeline, thanks to Rok's snapshotting technology and Kale mounting those snapshotted volumes into the pipeline steps.

Restart the notebook kernel by clicking on the Restart icon:

f2e5d588ac0e943e.png

Run the imports cell to import all the necessary libraries.

47d475f3de9e7d39.png

Load and prepare your data

Load the dataset:

88c74bf52ac9dac9.png

Our target variable is SalePrice. Let's keep it in a variable:

cfff441b0b12758c.png

Encode the ordinal variables:

38e17c4577f8a431.png

Unfold the dates to engineer more features:

7ad224e84a65b816.png

Split the dataset into train and valid sets:

143589fdbde853ce.png

Extract features and labels into numpy arrays:

66f5732c8a3e2681.png

Group together the dataset using the Kale Dataset API:

2cc3f0e5ee2f0520.png

Start the AutoML workflow using Kale

Now, we are ready to run our AutoML experiment using Kale.

Create a Katib configuration. This step will ensure that after finding the best configuration, Kale will start a Katib experiment to further optimize the model. In the next section, you will find more information about how we use Katib to perform hyperparameter optimization on the best-performing configuration.

2f196316f94ae69d.png

Run the AutoML experiment:

d86af5b4434559ef.png

Kale creates a Kubeflow Pipelines (KFP) experiment. Click on Experiment details to view it:

d5ae27fab5f213c9.png

This will open a page like this one:

6342096e8ee38f55.png

Kale creates a KFP pipeline that orchestrates the AutoML workflow. Click on Run details to view it:

dc8c2ca4c8b57c9a.png

This will open a page like this one:

f3bda1a52d45cc4a.png

You can monitor the experiment by printing a summary of the AutoML task at any point in time. This is how it will look like when all the runs have completed.

fdcc276e5e353611.png

In this section, we will reveal the magic behind the autoML workflow that we created previously.

Get the different configurations

Let's take a look at the orchestration pipeline that Kale produced. During the sklearn-get-configurations step, Kale asks auto-sklearn to provide AutoML configurations. A configurations is an auto-sklearn "suggestion", composed of a model architecture and a combination of model parameters that auto-sklearn thinks will perform well on the provided dataset.

509a155c85c00ce1.png

In this example, Kale asks from auto-sklearn to produce 4 different configurations. You defined this in a notebook cell previously as shown below:

844f4ec9922d2b48.png

In a production environment, this number would be in the order of tens or hundreds. Kale would ask auto-sklearn to suggest many different configurations based on the provided task and dataset.

Note that some configurations could share the same model architecture, but different initialization parameters.

If you click the step and go to the ML Metadata tab, you will see the configurations that auto-sklearn produced:

9ad2b6beb697b56f.png

Let's click on the second configuration:

1042e841c2ecfdd4.png

We can see the model and the initialization parameters:

70cef4649d4cf862.png

Create a KFP run for each configuration

Auto-sklearn configurations are just an "empty" description of a model and its parameters. We need to actually instantiate these models and train them on our dataset.

Kale will start a KFP run for each configuration, that is it will start 4 pipeline runs. By default, these KFP runs will execute in parallel, so this can become a massive task for your cluster. Here, we have configured Kale so that it starts just 2 runs simultaneously at any given point in time. We did this previously in the notebook:

34856e65db245754.png

Kale produces these KFP runs during the sklearn-run-configurations step:

6a396486e265e58.png

Notice the pipeline icon that this steps has. This means that it produces one or more KFP runs. If you click on the step and go to the ML Metadata tab, you will find the 4 runs that Kale produces during this step (Kale will populate the metadata tab with these link at runtime, while it starts the runs):

c59454c352d4295.png

These pipelines look identical. They just train a different model and/or have different initialization parameters. Let's click one of them to view the KFP run:

e81f15af366aa963.png

As you can see, the pipeline consists of an sklearn-transformer step that transforms the dataset, an sklearn-estimator step that trains the model, and an sklearn-predictor step that predicts over the test dataset.

During the pipeline runs, Kale is logging MLMD (Machine Learning Metadata) artifacts in a persistent way. These artifacts are backed by Rok snapshots so they are versioned and persistent, regardless of the pipelines lifecycle. This way, you can have complete visibility of the inputs and outputs of the pipeline steps. And most importantly, you can always be aware of the dataset, the model, and the parameters you used for the training process.

If you click on the sklearn-transformer step, and you go to the ML Metadata tab, you can see the artifact of a Rok snapshot that contains the input dataset:

7258a9bfc325eb4.png

And a KaleTransformer, the object that is transforming the data:

e49da30ba8f9ce21.png

If you go to the ML Metadata tab of the next step, called sklearn-estimator, you can see the artifact of the trained model.

5c493550d7df0040.png

If you go to the Run output of this pipeline, you can view the pipeline Metrics. Here we have one, the mean squared logarithmic error. This metric shows you how well this particular auto-sklearn configuration performed on the dataset.

d21b5656bbb8693f.png

Monitor the KFP runs

Now, let's go back to the orchestration pipeline and go through the next steps. The monitor-kfp-runs step waits for all the pipelines to complete, that is the 4 KFP runs that have different configurations.

879e4d4b62a30517.png

Get the best configuration

When the runs succeed, Kale gets the one with the best performance. Kale knows, based on the optimization metric you provided in the notebook, if it needs to look for the highest or the lowest KFP metric value. This happens in the get-best-configuration step:

4339e5ab449609a6.png

Hyperparameter optimization

Kale has found the best configuration with the help of auto-sklearn. Note that this was a "Meta-Learning" suggestion, so no one can stop us from trying to squeeze something more out of this model architecture. We want Kale to run a HP Tuning experiment, using Katib, to further explore different sets of initialization parameters, starting from the ones of this configurations. Kale will take care of creating a suitable search space, around these initialization parameters.

You have already configured the Katib experiment in your notebook. As in the configurations case, the more trials you run the more chances you have to improve the original result. We expect to have no more than 2 trials as you can see in the code snippet below:

2f196316f94ae69d.png

Notice that the run-katib-experiment pipeline step has the Katib logo as an icon. This means that in this step Kale produces a Katib experiment:

503db5bf76c0e5fb.png

If you click this step and go to the ML Metadata tab, you will find the link to the corresponding Katib experiment:

cc61f26438c9cb59.png

Click on the link to go to the Katib UI and view the experiment:

362eeffbf575fd76.png

Now, go to the TRIALS tab to view the 2 different trials and how they performed:

c6c3dbb920a37890.png

Notice that the best-performing trial is highlighted.

Click on the pipeline icon on the right to view the KFP run that corresponds to this Katib trial:

e97e30db14673951.png

Kale implements a shim to have the Trials actually run pipelines in Kubeflow Pipelines, and then collect the metrics from the pipeline runs. This is the KFP run of the Katib trial that performed best:

bdf582769b971ddd.png

Notice that this pipeline is exactly the same as the configuration runs we described previously. This is expected, as we are running the same pipeline and changing the parameters to perform HP tuning. However, this KFP run has different run parameters from the configuration run. If you go to the Config tab, you will see the run parameters of this pipeline. Katib was the one that selected these specific parameters.

ec8cba94b36da09d.png

In this section, you will serve the best-performing model using Kale and KF Serving.

Create a KF Serving inference server

Now that we have found the best configuration and performed hyperparameter optimization on it, we should have an easy way to serve the corresponding trained model. But how can we find it?

As we described previously, Kale logs MLMD artifacts in a persistent way using Rok snapshots. This means that we have a snapshot of the best-performing model, as well as a snapshot of all the other trained models.

To find the best-performing model and serve it, click on the sklearn-estimator step and go to the ML Metadata tab:

caa0899b848d9e60.png

Look at the Outputs and you will see the model. Notice that it has a unique Artifact ID. In our case is 255, but this will be probably different in your case:

9ac5e1ff41fb608c.png

Copy this Artifact ID, as we are going to need it to serve the model.

Let's go back to the Notebook and find this cell:

617e18aecc8edea7.png

Replace the placeholder with the Artifact ID you just copied and run the cell:

ac84cd4dadd6dc58.png

This will trigger Kale to create a new PVC that backs this model, configure and create a new KF Serving inference server, and apply it. In just a few seconds you will have a model running in Kubeflow!

View the inference server

Run the next cell to see where the model is served:

81d8eb508e630a60.png

Click on here to navigate to the Models UI:

b2ac1fbcbf45a012.png

This is the model you just served:

3e1f8ea6d05d015e.png

Run predictions against the model

You have successfully served the best-performing model from inside your notebook! Now, let's run some predictions against it.

Go back to the notebook and run this cell:

f427f948ac3f85bc.png

Congratulations! You have successfully created a KF Serving inference server that serves the best-performing model and run predictions against it.

In this section, you will create a TensorBoard server to view the logs that the training of the model produced.

Create a TensorBoard Server

Besides training and saving models, Kale was also producing Tensorboard reports for each and every pipeline. Kale versions and snapshots there reports using Rok and creates corresponding MLMD Artifacts

Let's go back to the KFP run that trained the best-performing model to view the ML Metadata tab of the sklearn-predictor step:

94669c88e8f7ed66.png

Scroll down to the Outputs to find the TensorboardLogs artifact:

e2676f6fc67bfe26.png

Copy its Artifact ID, as you are going to need it to start the TensorBoard server from inside your notebook.

Go back to the notebook and find this cell:

6cae45d78f1eb781.png

Replace the placeholder with the Artifact ID you just copied and run the cell:

21b423bae8939a5b.png

Wait for a few minutes for the TensorBoard Server to get up and running. Kale is now creating a TensorBoard server that is backed by a Rok PVC.

View the TensorBoard logs

Click on Tensorboard server to view the Tensorboard server you just created:

226afffc3ad460f9.png

Here are the logs that TensorBoard provides. We see the prediction error for the random forest regression algorithm:

7dc7ac71a8967075.png

Congratulations, you have successfully run an end-to-end autoML workflow using Kubeflow (MiniKF), Kale, and Rok!

What's next?

Join the Kubeflow Community:

Further reading