Mar 24, 2023

Building a Simple Tabular Model

Learn how to use pipelines to transform data, build a simple neural network to make predictions, run hyperparameter tuning, and log experiments and explore data with this post on building a simple tabular model.

AI MACHINE_LEARNING DATA

Santiago

Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.

Member of Software Developers

95% of Machine Learning in the industry uses tabular data, yet most people can't stop talking about ChatGPT.

Let's do something different:

Let's build a simple tabular model together.

Follow this, and you will have accomplished something by the end of it:
— Santiago (@svpino) March 24, 2023
This post will show you a few things:

1. How to use pipelines to transform data.
2. How to build a simple neural network to make predictions.
3. Running hyperparameter tuning.
4. How to log experiments and explore your data.
— Santiago (@svpino) March 24, 2023
We'll use the Penguins dataset.

It's a simple problem.

Our goal is to determine the species of a penguin based on the following:

• Culmen length
• Culmen depth
• Flipper length
• Body mass

Here is a Google Colab with all of the code: https://t.co/Vsi0SNptlq pic.twitter.com/dmiJz1wRb0
— Santiago (@svpino) March 24, 2023
You need the following to run the code:

• keras_tuner: A library that will help us find hyperparameters for our model.

• comet_ml: A library to log our experiments and explore the data.

You can create a free @Cometml account here and grab an API key: https://t.co/Mq8LWBMy0p
— Santiago (@svpino) March 24, 2023
The first step is to import every library we need.

More importantly, I initialize @Cometml and set up an experiment. You'll need your API_KEY to move on from here.@Cometml will log everything we do into this experiment. pic.twitter.com/a4JIMt9ZK8
— Santiago (@svpino) March 24, 2023
The second step is to download the data.

I'm using a version of the Penguins dataset that I found online.

I'm also logging the data as part of the Experiment. This will let us look at it without having to write any code. pic.twitter.com/ssUR4GzgBa
— Santiago (@svpino) March 24, 2023
If you go into @Cometml, you can configure a Data Panel and pick the "penguins.csv" we just logged.

This will let you see the entire dataset. Reorganize, sort, and filter columns visually without writing any code.

Let's get back to the code: pic.twitter.com/cIMyFzDXHY
— Santiago (@svpino) March 24, 2023
Time to configure a pipeline to transform the dataset.

Pipelines are one of the most important concepts you should learn in Machine Learning. They are bundled steps that get applied to your data.

3 advantages:

• Cleaner code
• More robust
• Easier to deploy in production pic.twitter.com/XeaVBi36YQ
— Santiago (@svpino) March 24, 2023
Here is what happens in our pipeline:

Numerical values:

1. Impute missing values using the mean of the column.
2. Scale every value.

Categorical values:

1. Impute missing values using the most frequent.
2. Encode them using one-hot encoding.
— Santiago (@svpino) March 24, 2023
Step 4 is to split the dataset and run the pipeline to transform the data.

Notice that we split the data, then apply the transformation to every set individually.

This is critical to avoid leakages.

At the end of this step, our data is ready to go. pic.twitter.com/Cr7KmOXuaW
— Santiago (@svpino) March 24, 2023
Let's now create a simple neural network to make predictions.

Something important:

There are many ways to solve a problem, but when using tabular data, 99% of the time, you want to look into gradient-boosting algorithms.

But here, I'm using a neural network. pic.twitter.com/eW5eVyJ0xe
— Santiago (@svpino) March 24, 2023
We want to tune the hyperparameter of the model using the Keras Tuner.

That's why you see that instead of using values directly, I'm using the following:

• hp. Int()
• hp. Choice()

This is how I specify the hyperparameters I want to tune.
— Santiago (@svpino) March 24, 2023
We have our data ready. We have a model ready.

Time to find good value for the hyperparameters.

Step 6 is about running a tuner to find the best values for the neural network.

When it finishes, the tuner will retain the best model. pic.twitter.com/BCxIqWApz8
— Santiago (@svpino) March 24, 2023
Finally, we can use the best model to predict the samples on the test set.

I got 98.5% accuracy.

The final few lines create a data frame with my test data and the predictions.

I then use the experiment to log the results and show the panels. pic.twitter.com/K0dtZqMrCY
— Santiago (@svpino) March 24, 2023
In @Cometml, I added a new Data Panel to display the results.csv I just logged.

I can now look through the data without having to deal with code.

For example, we can filter the data to see the model's mistakes. pic.twitter.com/UrLNsoePXS
— Santiago (@svpino) March 24, 2023
Here is what you need to do next:

1. Create a free @Cometml account: https://t.co/Mq8LWBMy0p

2. Run the code. Change it. Experiment with it. https://t.co/Vsi0SNptlq

3. Read about Scikit-Learn Pipelines and the Keras Tuner.

4. Follow @svpino. More of this is coming!
— Santiago (@svpino) March 24, 2023