Kubeflow is a machine learning toolkit for Kubernetes. The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
A Kubeflow deployment is:
Kubeflow gives you the ability to organize loosely-coupled microservices as a single unit and deploy them to a variety of locations, including on a laptop, on-premises, or in the cloud.
This codelab walks you through creating your own Kubeflow deployment using MiniKF, then running a completely automated workflow to discover and train machine learning models starting from raw data. You do all that from inside a Jupyter Notebook.
In this codelab, you will create a mechanism to automate the discovery and training of Machine Learning models, starting from raw data. You will start from a dataset, and then call a Kale API to instantiate a process that will search for the best suitable model for the input data. Then, you will automatically perform hyperparameter optimization on this model. Upon completion, your infrastructure will contain:
This is an advanced codelab focused on Kubeflow. For more background and an introduction to the platform, see the Introduction to Kubeflow documentation. Non-relevant concepts and code blocks are glossed over and provided for you to simply copy and paste.
Make sure you have an AWS account and you are able to launch EC2 instances.
Follow the steps below to create a GCP project or configure your existing GCP project. If you plan to use an existing GCP project, make sure that the project meets the minimum requirements described below. The first step is to open the resource manager in the GCP Console.
Create a new project or select an existing project:
Check the following minimum requirements:
For more help with setting up a GCP project, see the GCP documentation.
After setting up your GCP project, go directly to the instructions for installing MiniKF.
To open your pre-allocated GCP project, click the button below to visit the GCP Console and open the Home panel, found in the hamburger menu at the top left. If the screen is empty, click on Yes at the prompt to create a dashboard.
If the project is not already selected, click Select a project:
Select your project. You should only have one:
Make sure you have installed Vagrant and VirtualBox.
To install MiniKF on AWS, follow this guide.
In the Google Cloud Marketplace, search for "MiniKF".
Select the MiniKF virtual machine by Arrikto:
Click the LAUNCH button and select your project:
In the Configure & Deploy window, choose a name and a zone for your MiniKF instance and leave the default options. Then click on the Deploy button:
Wait for the MiniKF Compute instance to boot up:
SSH to MiniKF
When the MiniKF VM is up, connect and log in by clicking on the SSH button. Follow the on-screen instructions to run the command
minikf to see the progress of the deployment of Minikube, Kubeflow, and Rok. This will take a few minutes to complete.
When installation is complete and all pods are ready, visit the MiniKF dashboard and log in using the MiniKF username and password:
Congratulations! You have successfully deployed MiniKF on GCP. You can now create notebooks, write your ML code, run Kubeflow Pipelines, and use Rok for data versioning and reproducibility.
To install MiniKF on your desktop/laptop using Vagrant, follow this guide.
During this section, you will run the Blue Book for Bulldozers example, a Kaggle competition. The goal is to predict the sale price of bulldozers sold at auctions.
Navigate to the Notebooks link on the Kubeflow central dashboard.
Click on New Server.
Specify a name for your notebook server.
Make sure you have selected the following Docker image.
Click on Launch to create the notebook server.
When the notebook server is available, click on Connect to connect to it.
A new tab will open up with the JupyterLab landing page. Create a new terminal in JupyterLab.
In the terminal window, run this command to navigate to download the notebook and the data that you will use for the remainder of the lab:
git clone https://github.com/kubeflow-kale/kale -b kubecon21eu
The cloned repository contains a series of curated examples with data and annotated notebooks.
In the sidebar, navigate to the folder
kale/examples/bulldozers-kaggle-competition/ and open the notebook
In this section, you will create an AutoML Workflow starting from your Notebook.
AutoML is a very broad concept, it includes several Machine Learning techniques such as Hyperparameter Tuning (HP), Neural Architecture Search (NAS), and Meta-Learning. In this tutorial, when talking about AutoML, we refer to a combination of HP and Meta-Learning.
If you don't know how Kale can simplify doing HP in Kubeflow, we suggest this ( https://arrik.to/democ2p) tutorial as well. It is not a prerequisite to complete this tutorial, so just keep it in mind and come back to it later on.
Meta-Learning is a particular technique that focuses on "learning to learn". It usually refers to a system able to improve the learning of complex tasks, by reusing previous experience. More concretely, given just a dataset and a task (e.g., classification or regression) as input, the Meta-Learning system employed by Kale suggests some model architectures that it thinks will perform well. This suggestion system is based on previous knowledge on a wealth of other datasets, and how closely your input data relates to some of them.
Consequently, you are able to discover and test hundreds of candidate models without breaking a sweat!
This a high level overview of the components that the Kale AutoML experiments orchestrates.
It all starts from the notebook, where you call a Kale API providing a dataset and task. Then Kale takes over, you can sit back and relax, and watch the following process unfold as Kale:
Finally, once all of this is done, you can go back to your notebook and, with a single Kale API call, you can:
You can do this anytime, since all models and TensorBoard reports are immutable Rok snapshots.
Now that you have a high level picture of the entire workflow, let's dive into the single parts.
Install the necessary dependencies.
Normally, you should create a new Docker image to be able to run this notebook as a Kubeflow pipeline, to include the newly installed libraries. Fortunately, Rok and Kale make sure that any libraries you install during development will find their way to your pipeline, thanks to Rok's snapshotting technology and Kale mounting those snapshotted volumes into the pipeline steps.
Restart the notebook kernel by clicking on the Restart icon:
imports cell to import all the necessary libraries.
Load the dataset:
Our target variable is
SalePrice. Let's keep it in a variable:
Encode the ordinal variables:
Unfold the dates to engineer more features:
Split the dataset into
Extract features and labels into
Group together the dataset using the Kale
Now, we are ready to run our AutoML experiment using Kale.
Create a Katib configuration. This step will ensure that after finding the best configuration, Kale will start a Katib experiment to further optimize the model. In the next section, you will find more information about how we use Katib to perform hyperparameter optimization on the best-performing configuration.
Run the AutoML experiment:
Kale creates a Kubeflow Pipelines (KFP) experiment. Click on Experiment details to view it:
This will open a page like this one:
Kale creates a KFP pipeline that orchestrates the AutoML workflow. Click on Run details to view it:
This will open a page like this one:
You can monitor the experiment by printing a summary of the AutoML task at any point in time. This is how it will look like when all the runs have completed.
In this section, we will reveal the magic behind the autoML workflow that we created previously.
Let's take a look at the orchestration pipeline that Kale produced. During the sklearn-get-configurations step, Kale asks auto-sklearn to provide AutoML configurations. A configurations is an auto-sklearn "suggestion", composed of a model architecture and a combination of model parameters that auto-sklearn thinks will perform well on the provided dataset.
In this example, Kale asks from auto-sklearn to produce 4 different configurations. You defined this in a notebook cell previously as shown below:
In a production environment, this number would be in the order of tens or hundreds. Kale would ask auto-sklearn to suggest many different configurations based on the provided task and dataset.
Note that some configurations could share the same model architecture, but different initialization parameters.
If you click the step and go to the ML Metadata tab, you will see the configurations that auto-sklearn produced:
Let's click on the second configuration:
We can see the model and the initialization parameters:
Auto-sklearn configurations are just an "empty" description of a model and its parameters. We need to actually instantiate these models and train them on our dataset.
Kale will start a KFP run for each configuration, that is it will start 4 pipeline runs. By default, these KFP runs will execute in parallel, so this can become a massive task for your cluster. Here, we have configured Kale so that it starts just 2 runs simultaneously at any given point in time. We did this previously in the notebook:
Kale produces these KFP runs during the sklearn-run-configurations step:
Notice the pipeline icon that this steps has. This means that it produces one or more KFP runs. If you click on the step and go to the ML Metadata tab, you will find the 4 runs that Kale produces during this step (Kale will populate the metadata tab with these link at runtime, while it starts the runs):
These pipelines look identical. They just train a different model and/or have different initialization parameters. Let's click one of them to view the KFP run:
As you can see, the pipeline consists of an sklearn-transformer step that transforms the dataset, an sklearn-estimator step that trains the model, and an sklearn-predictor step that predicts over the test dataset.
During the pipeline runs, Kale is logging MLMD (Machine Learning Metadata) artifacts in a persistent way. These artifacts are backed by Rok snapshots so they are versioned and persistent, regardless of the pipelines lifecycle. This way, you can have complete visibility of the inputs and outputs of the pipeline steps. And most importantly, you can always be aware of the dataset, the model, and the parameters you used for the training process.
If you click on the sklearn-transformer step, and you go to the ML Metadata tab, you can see the artifact of a Rok snapshot that contains the input dataset:
KaleTransformer, the object that is transforming the data:
If you go to the ML Metadata tab of the next step, called sklearn-estimator, you can see the artifact of the trained model.
If you go to the Run output of this pipeline, you can view the pipeline Metrics. Here we have one, the mean squared logarithmic error. This metric shows you how well this particular auto-sklearn configuration performed on the dataset.
Now, let's go back to the orchestration pipeline and go through the next steps. The monitor-kfp-runs step waits for all the pipelines to complete, that is the 4 KFP runs that have different configurations.
When the runs succeed, Kale gets the one with the best performance. Kale knows, based on the optimization metric you provided in the notebook, if it needs to look for the highest or the lowest KFP metric value. This happens in the get-best-configuration step:
Kale has found the best configuration with the help of auto-sklearn. Note that this was a "Meta-Learning" suggestion, so no one can stop us from trying to squeeze something more out of this model architecture. We want Kale to run a HP Tuning experiment, using Katib, to further explore different sets of initialization parameters, starting from the ones of this configurations. Kale will take care of creating a suitable search space, around these initialization parameters.
You have already configured the Katib experiment in your notebook. As in the configurations case, the more trials you run the more chances you have to improve the original result. We expect to have no more than 2 trials as you can see in the code snippet below:
Notice that the run-katib-experiment pipeline step has the Katib logo as an icon. This means that in this step Kale produces a Katib experiment:
If you click this step and go to the ML Metadata tab, you will find the link to the corresponding Katib experiment:
Click on the link to go to the Katib UI and view the experiment:
Now, go to the TRIALS tab to view the 2 different trials and how they performed:
Notice that the best-performing trial is highlighted.
Click on the pipeline icon on the right to view the KFP run that corresponds to this Katib trial:
Kale implements a shim to have the Trials actually run pipelines in Kubeflow Pipelines, and then collect the metrics from the pipeline runs. This is the KFP run of the Katib trial that performed best:
Notice that this pipeline is exactly the same as the configuration runs we described previously. This is expected, as we are running the same pipeline and changing the parameters to perform HP tuning. However, this KFP run has different run parameters from the configuration run. If you go to the Config tab, you will see the run parameters of this pipeline. Katib was the one that selected these specific parameters.
In this section, you will serve the best-performing model using Kale and KF Serving.
Now that we have found the best configuration and performed hyperparameter optimization on it, we should have an easy way to serve the corresponding trained model. But how can we find it?
As we described previously, Kale logs MLMD artifacts in a persistent way using Rok snapshots. This means that we have a snapshot of the best-performing model, as well as a snapshot of all the other trained models.
To find the best-performing model and serve it, click on the sklearn-estimator step and go to the ML Metadata tab:
Look at the Outputs and you will see the model. Notice that it has a unique Artifact ID. In our case is 255, but this will be probably different in your case:
Copy this Artifact ID, as we are going to need it to serve the model.
Let's go back to the Notebook and find this cell:
Replace the placeholder with the Artifact ID you just copied and run the cell:
This will trigger Kale to create a new PVC that backs this model, configure and create a new KF Serving inference server, and apply it. In just a few seconds you will have a model running in Kubeflow!
Run the next cell to see where the model is served:
Click on here to navigate to the Models UI:
This is the model you just served:
You have successfully served the best-performing model from inside your notebook! Now, let's run some predictions against it.
Go back to the notebook and run this cell:
Congratulations! You have successfully created a KF Serving inference server that serves the best-performing model and run predictions against it.
In this section, you will create a TensorBoard server to view the logs that the training of the model produced.
Besides training and saving models, Kale was also producing Tensorboard reports for each and every pipeline. Kale versions and snapshots there reports using Rok and creates corresponding MLMD Artifacts
Let's go back to the KFP run that trained the best-performing model to view the ML Metadata tab of the sklearn-predictor step:
Scroll down to the Outputs to find the TensorboardLogs artifact:
Copy its Artifact ID, as you are going to need it to start the TensorBoard server from inside your notebook.
Go back to the notebook and find this cell:
Replace the placeholder with the Artifact ID you just copied and run the cell:
Wait for a few minutes for the TensorBoard Server to get up and running. Kale is now creating a TensorBoard server that is backed by a Rok PVC.
Click on Tensorboard server to view the Tensorboard server you just created:
Here are the logs that TensorBoard provides. We see the prediction error for the random forest regression algorithm:
Congratulations, you have successfully run an end-to-end autoML workflow using Kubeflow (MiniKF), Kale, and Rok!
Join the Kubeflow Community: