Jun 13, 2023

How to Train Machine Learning Models with Too Many Images

Training machine learning models with too many images can be a difficult task. This article looks at how to use better data instead of more data, as well as how real-life problems are usually the opposite of what is taught in books.

AI MACHINE_LEARNING DATA

Santiago

Machine Learning. I run https://t.co/iZifcK7n47 and write @0xbnomial.

Member of Software Developers

Real-life problems are usually the opposite of what you learn in books.

Toy datasets are always small, but many companies deal with a lot of data, and most engineers have no clue what to do with that.

Here is one of the most exciting developments in the field:
— Santiago (@svpino) June 13, 2023
One of the biggest problems in the industry is how to train machine learning models when you have too many images.

And you can't just use all of them:

• Too much time
• Too much money

If you want to get a raise, you need to figure out how to use better data, not more of it.
— Santiago (@svpino) June 13, 2023
Despite what you learned in school, the solution is not to take a random portion of the images to train your model.

There's a better way:

Using embeddings, we can compare visual similarities between images and build a better dataset.

Here is one example:
— Santiago (@svpino) June 13, 2023
The team @superb_hq ran an experiment using the LOCO dataset:

They trained a model using 1,000 random samples and compared it with a model trained on 1,000 curated samples.

The F1-Score of the latter was 14.5% better on average across classes! pic.twitter.com/nouJLJHZuP
— Santiago (@svpino) June 13, 2023
Here is a comparison between the random and the curated data.

Notice how the random data has a similar distribution to the overall dataset, but the curated data looks very different:

• It undersamples the majority classes
• It oversamples the minority classes pic.twitter.com/rYIDpSiLLG
— Santiago (@svpino) June 13, 2023
These results are very exciting!

No manual work is involved: we can curate the data automatically.

The tool clusters the embeddings and finds the best samples to build a dataset.

Here is more information: https://t.co/E5pWUr3Sfa
— Santiago (@svpino) June 13, 2023