Mar 19, 2023

Be Careful With Random Splits When Training Machine Learning Models

When training machine learning models, it is important to be careful with random splits if your dataset has any grouping between samples. Learn why random splits won't work and how to use your data to your advantage.

AI MACHINE_LEARNING TECHNOLOGY

Santiago

Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.

Member of Software Developers

Machine Learning models are sneaky little bastards that use any available shortcuts to optimize their evaluation metric.

Every tutorial speaks about splitting your data randomly.

But you must be careful with this:
— Santiago (@svpino) March 19, 2023
You don't want to use random splits if your dataset has any grouping between samples.

For example, imagine you have a dataset with pictures of shoes and want to identify their brand.

You collected 10000 pictures from 1000 pairs of shoes.

Here is why a random split won't work:
— Santiago (@svpino) March 19, 2023
1,000 pairs of shoes and 10,000 pictures. This means that you have many pictures of the same shoes.

Splitting the data randomly between a train and a validation set will likely send images from the same shoe to both splits.

That's a big problem.
— Santiago (@svpino) March 19, 2023
Randomly splitting the dataset will give the model exactly what it's looking for:

A shortcut.

Imagine one of the shoes is worn out on the left side. The model might use this to conclude that images of the same shoe in the validation set belong to the same brand.

That's bad.
— Santiago (@svpino) March 19, 2023
Here we have a model using the wrong signal to make the correct conclusion.

This will lead to an inflated validation score. (Too good to be true results.)

Here you have a leaky validation strategy.
— Santiago (@svpino) March 19, 2023
Bottom line:

A dataset with correlation or groupings between individual samples is not a good candidate for random splits.

If information leaks from the training data into the validation data, your validation score will look much better than it should be.
— Santiago (@svpino) March 19, 2023
One of Andrew Ng's papers made this mistake.

I wrote about this in last Friday's newsletter: https://t.co/6qoDHK6bs5

For more stories like this, follow @svpino and make sure you subscribe to the newsletter! pic.twitter.com/yq59FiumC8
— Santiago (@svpino) March 19, 2023