Be Careful With Random Splits When Training Machine Learning Models
When training machine learning models, it is important to be careful with random splits if your dataset has any grouping between samples. Learn why random splits won't work and how to use your data to your advantage.
Santiago
Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.
-
Machine Learning models are sneaky little bastards that use any available shortcuts to optimize their evaluation metric.
— Santiago (@svpino) March 19, 2023
Every tutorial speaks about splitting your data randomly.
But you must be careful with this: -
You don't want to use random splits if your dataset has any grouping between samples.
— Santiago (@svpino) March 19, 2023
For example, imagine you have a dataset with pictures of shoes and want to identify their brand.
You collected 10000 pictures from 1000 pairs of shoes.
Here is why a random split won't work: -
1,000 pairs of shoes and 10,000 pictures. This means that you have many pictures of the same shoes.
— Santiago (@svpino) March 19, 2023
Splitting the data randomly between a train and a validation set will likely send images from the same shoe to both splits.
That's a big problem. -
Randomly splitting the dataset will give the model exactly what it's looking for:
— Santiago (@svpino) March 19, 2023
A shortcut.
Imagine one of the shoes is worn out on the left side. The model might use this to conclude that images of the same shoe in the validation set belong to the same brand.
That's bad. -
Here we have a model using the wrong signal to make the correct conclusion.
— Santiago (@svpino) March 19, 2023
This will lead to an inflated validation score. (Too good to be true results.)
Here you have a leaky validation strategy. -
Bottom line:
— Santiago (@svpino) March 19, 2023
A dataset with correlation or groupings between individual samples is not a good candidate for random splits.
If information leaks from the training data into the validation data, your validation score will look much better than it should be. -
One of Andrew Ng's papers made this mistake.
— Santiago (@svpino) March 19, 2023
I wrote about this in last Friday's newsletter: https://t.co/6qoDHK6bs5
For more stories like this, follow @svpino and make sure you subscribe to the newsletter! pic.twitter.com/yq59FiumC8