A while ago, I came across the weirdest Facebook group ever. It was the Flat Earth society. Members purportedly believed that the Earth was flat! Needless to say, I joined the group to have fun at the members’ expense. Later I found that most of the members were Round Earthers like any normal person. These rounders mostly went around trolling the Flat Earthers.
A couple of years passed and I didn’t think about the group at all. But recently, a lot of notifications from the group hit my Facebook news feed. There was a true believer having a huge argument with everyone.
He simply refused to listen to reason. He ignored all data against his claim saying that they were fake. And his argument for Flat Earth? Look outside! The Earth appears flat, so it must be so!
This whole episode got me thinking. You see, I’m quite interested in Data Science and Machine Learning. So I was wondering if this could be used as an analogy to explain overfitting. And should I write a post explaining overfitting using this analogy?
Excited, I called up one of my data scientist buddies, Vishnu Prasad. If you’ve been following us at all, he’s one of the stars of our Venturesity community. And no, he didn’t pay me money to say that. 😉
Anyway, I called him up and explained to him about my great idea. He didn’t seem that enthused and in fact, said that the analogy didn’t quite work. However, he did encourage me to start writing the post and see where it leads me. And here we are.
Let’s get back to the topic at hand. If the flat earth example is NOT a good way to explain the perils of overfitting, what is? Vishnu suggested that I explain overfitting by using the example of the weather.
In the Northern Hemisphere, where I live, it’s summer in July and August. Assume you look at the weather data during this period. Using this as the predictor of how the weather is going to be like in December, you roam around the city in December in your shorts and tank top. But you freeze to death from hypothermia!
But then, after some back and forth, we realized that this doesn’t quite explain overfitting. The example just shows that there wasn’t sufficient data to make a suitable prediction.
Vishnu pointed out an excellent example from stats.stackexchange.com. Analysis of US Census data vs. time!
They used several models: linear fit, quadratic fit etc. But the best fit was a quartic fit as shown in the pic below:
However, as you can see, this prediction is absurd. The model predicts 0 population by 2050, and negative population after that!
That’s overfitting in a nutshell. In alternate language, your model overfits the data at hand if it works too well on your training data but is a poor predictor for your test data. And in some cases, it could lead to life or death mistakes! So don’t overfit!!
Coming back to the flat earth thing, I couldn’t help but think that if I were to tweak it a little, I COULD explain overfitting. I mulled over it for a while and then concluded that it may not be the best way to go about it. But it would help highlighting a potential pitfall when it comes to coming up with a statistical model.
What do I mean by that?
If we look at the flat earth guy, he threw out all data that didn’t conform to his worldview. What do we generally throw out in noisy data? Outliers! Normally this doesn’t cause issues but there are times it will. Just like in the case of the Flat Earth guy. (Although, he probably only “kept” the outliers.)
Okay, let me come up with a better, less frivolous illustration. Let’s think of the situation in the early part of the twentieth century. Just before Einstein came up with his theory of General Relativity. Newton’s law of Gravitation explained almost everything about the orbits of the planets around the Sun. But there were several discrepancies when it came to the orbit of Mercury. There were some slight deviations from the theory!
They could have been explained away as outliers. And for a long time, they were. But eventually, Einstein realized that this was a serious issue and came up with a solution.
And thus, Einstein’s Theory of General Relativity was born.
What’s the lesson learned? Sometimes, outliers in your data are telling you something. Most of the time, it’s just noise, but occasionally, it’s the real deal!