OverfittingThe Most Important Problem in Machine Learning

Machine Learning

Written by Hendrik. Jul 27, 2021

7 min read

What's Going On Here?

This semester, I took part in introductory courses for Artificial Intelligence and Natural Language Processing. Machine Learning, without having any prior knowledge of it, was used at every corner. Unfortunately, though, it was only ever explained very vaguely and we never looked into it very much. It's been something of a black box all semester. Merely a side note. However, this whole machine learning thing seems interesting and it's made me curious to learn more about it.

Pedro Domingos, author of The Master Algorithm, claims that overfitting is the most important problem in machine learning. That's quite a bold claim. We had only discussed overfitting once during the semester, so it didn't seem like that big of a deal. So, what's all the fuss about? What is overfitting, and how can it be avoided?

How are Machines Trained to Learn?
Reacting to New Data by Generalizing
What is Overfitting?
How to Avoid Overfitting

How are Machines Trained to Learn?

One of the three main types of learning for machines is called supervised learning. This means that the machine is fed a large set of labeled data (labeled meaning that it is known what this data resembles). This trains the model. The machine learns to make predictions as to what the data resembles and because the data is labeled (and thus the outcome is known), it can use that to fit the model's parameters until the prediction sufficiently resembles the actual result. A side effect of this is that the more data is used in training, the better the model will be at making predictions.

Reacting to New Data by Generalizing

The goal of learning is to be able to take unlabeled data, where the outcome may not be known, and make accurate predictions as to what it resembles. Meaning, the machine has to be able to generalize based on the data it has seen.

As humans, it is easy to make generalizations. For example, a car salesman may lie to you in order to make a sale. From this, you might infer that all salesmen lie in order to sell you something. This is generalization, which computers have a hard time being able to do.

Since the machine is now viewing never-before-seen data, it must thus react to the new data by generalizing. How well the machine can do so determines how successful the model is.

What is Overfitting?

Overfitting simple means that the machine is not able to make generalizations. This can happen because the model has memorized the training data rather than learn to generalize based on trends. So, it sees patterns in the data that are not actually there. It models the training data too closely – details and noise included.

A telltale sign that a model has been overfit is if it was able to make accurate predictions during the training phase with labeled data but unable to so when shown unlabeled data.

The Fine Line Between Blindness and Hallucination

Models that are learning walk a fine line between being blind and hallucinating. It is easy for a model attempting to learn to be restricted and thus not identify any patterns. It is also easy for a model to be overly complex and recognize patters in data that do not exist, thus hallucinating.

Pedro Domingos describes the following:

"A good learner is forever walking the narrow path between blindness and hallucination."

This is the central problem in machine learning. It is difficult to not fall outside this narrow path.

How to Avoid Overfitting

Restricting what the model can learn is the only safe way to avoid overfitting. An example of this would be only allowing it to learn short, conjunctive concepts that relate to one another.

There are a few more ways to avoid overfitting:

Use more data in training. This way, memorizing non-existing patterns becomes less likely.
Data augmentation. This is when the existing data is changed in order to diversify the existing training data.
Simplify the model so that it is no longer so complex and can thus no longer overfit.

Sources and Further Reading

Thanks for reading my post. I’d love to get feedback from you, so feel free to shoot me a tweet!

- Hendrik

DigitalOcean - A quick overview and review

A small overview and review of the unconventional cloud platform DigitalOcean.

2 min read

Aug 18, 2022

How to Estimate the Cost of a Software Project - Using a Three-Point System

Estimating the cost of a new software project can be tricky. In order to present my clients with the most realistic estimate, I break requirements down into individual tickets and then use the three-point system to estimate each ticket's effort.

7 min read

Mar 29, 2022

Is Software Development the WRONG Career Path for You?

I've recently realized that I was on the wrong career path as a software developer. How can you determine that you're on the wrong career path as well?

10 min read

Nov 12, 2021

Create a Newsletter Sign Up Form - Using Gatsby, Mailchimp, and reCAPTCHA

A newsletter signup is like a subscribe button for your blog. We'll look at how to create a signup form using Gatsby, Mailchimp, reCAPTCHA, and Netlify.

13 min read

178

May 23, 2021

Lost On a Mountain

You know that feeling of starting something new and being completely overwhelmed by it? Not knowing where to start, thinking this is too hard, and wanting to give up after the first road block? That's what this blog is about.

1 min read

May 17, 2021