Building a Simple Tabular Model
Learn how to use pipelines to transform data, build a simple neural network to make predictions, run hyperparameter tuning, and log experiments and explore data with this post on building a simple tabular model.
Santiago
Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.
-
95% of Machine Learning in the industry uses tabular data, yet most people can't stop talking about ChatGPT.
— Santiago (@svpino) March 24, 2023
Let's do something different:
Let's build a simple tabular model together.
Follow this, and you will have accomplished something by the end of it: -
This post will show you a few things:
— Santiago (@svpino) March 24, 2023
1. How to use pipelines to transform data.
2. How to build a simple neural network to make predictions.
3. Running hyperparameter tuning.
4. How to log experiments and explore your data. -
We'll use the Penguins dataset.
— Santiago (@svpino) March 24, 2023
It's a simple problem.
Our goal is to determine the species of a penguin based on the following:
• Culmen length
• Culmen depth
• Flipper length
• Body mass
Here is a Google Colab with all of the code: https://t.co/Vsi0SNptlq pic.twitter.com/dmiJz1wRb0 -
You need the following to run the code:
— Santiago (@svpino) March 24, 2023
• keras_tuner: A library that will help us find hyperparameters for our model.
• comet_ml: A library to log our experiments and explore the data.
You can create a free @Cometml account here and grab an API key: https://t.co/Mq8LWBMy0p -
The first step is to import every library we need.
— Santiago (@svpino) March 24, 2023
More importantly, I initialize @Cometml and set up an experiment. You'll need your API_KEY to move on from here.@Cometml will log everything we do into this experiment. pic.twitter.com/a4JIMt9ZK8 -
The second step is to download the data.
— Santiago (@svpino) March 24, 2023
I'm using a version of the Penguins dataset that I found online.
I'm also logging the data as part of the Experiment. This will let us look at it without having to write any code. pic.twitter.com/ssUR4GzgBa -
If you go into @Cometml, you can configure a Data Panel and pick the "penguins.csv" we just logged.
— Santiago (@svpino) March 24, 2023
This will let you see the entire dataset. Reorganize, sort, and filter columns visually without writing any code.
Let's get back to the code: pic.twitter.com/cIMyFzDXHY -
Time to configure a pipeline to transform the dataset.
— Santiago (@svpino) March 24, 2023
Pipelines are one of the most important concepts you should learn in Machine Learning. They are bundled steps that get applied to your data.
3 advantages:
• Cleaner code
• More robust
• Easier to deploy in production pic.twitter.com/XeaVBi36YQ -
Here is what happens in our pipeline:
— Santiago (@svpino) March 24, 2023
Numerical values:
1. Impute missing values using the mean of the column.
2. Scale every value.
Categorical values:
1. Impute missing values using the most frequent.
2. Encode them using one-hot encoding. -
Step 4 is to split the dataset and run the pipeline to transform the data.
— Santiago (@svpino) March 24, 2023
Notice that we split the data, then apply the transformation to every set individually.
This is critical to avoid leakages.
At the end of this step, our data is ready to go. pic.twitter.com/Cr7KmOXuaW -
Let's now create a simple neural network to make predictions.
— Santiago (@svpino) March 24, 2023
Something important:
There are many ways to solve a problem, but when using tabular data, 99% of the time, you want to look into gradient-boosting algorithms.
But here, I'm using a neural network. pic.twitter.com/eW5eVyJ0xe -
We want to tune the hyperparameter of the model using the Keras Tuner.
— Santiago (@svpino) March 24, 2023
That's why you see that instead of using values directly, I'm using the following:
• hp. Int()
• hp. Choice()
This is how I specify the hyperparameters I want to tune. -
We have our data ready. We have a model ready.
— Santiago (@svpino) March 24, 2023
Time to find good value for the hyperparameters.
Step 6 is about running a tuner to find the best values for the neural network.
When it finishes, the tuner will retain the best model. pic.twitter.com/BCxIqWApz8 -
Finally, we can use the best model to predict the samples on the test set.
— Santiago (@svpino) March 24, 2023
I got 98.5% accuracy.
The final few lines create a data frame with my test data and the predictions.
I then use the experiment to log the results and show the panels. pic.twitter.com/K0dtZqMrCY -
In @Cometml, I added a new Data Panel to display the results.csv I just logged.
— Santiago (@svpino) March 24, 2023
I can now look through the data without having to deal with code.
For example, we can filter the data to see the model's mistakes. pic.twitter.com/UrLNsoePXS -
Here is what you need to do next:
— Santiago (@svpino) March 24, 2023
1. Create a free @Cometml account: https://t.co/Mq8LWBMy0p
2. Run the code. Change it. Experiment with it. https://t.co/Vsi0SNptlq
3. Read about Scikit-Learn Pipelines and the Keras Tuner.
4. Follow @svpino. More of this is coming!