This is first in a series of blog posts related to some basic learning materials centered around Big Data. This first post focuses on the basic mathematical tools that everyone should know if they want to work in this domain.
Big Data. It’s all the rage now. And for good reason too. It is estimated that humanity generates about 2.5 quintillion bytes of data daily. (That’s 2.5 * 1018 bytes for the scientifically minded. Or even 2.5 exabytes if that’s more to your taste.) But how to make sense of all this data? It’s an open problem. However there are a variety of techniques brought to bear on the problem. One such technique is machine learning.
Machine learning is a vast subject in and of itself, but there are two types of machine learning paradigms: supervised learning and unsupervised learning.
Supervised learning takes a known set of input data and responses, and seeks to build prediction models. Two broad classes of supervised learning paradigms are regression and classification.
Unsupervised learning is used to find hidden patterns in unstructured or unlabelled data. The most common unsupervised learning technique is cluster analysis.
Prediction. That’s the name of the game. To that end mathematicians have developed many techniques to help in such prediction. One of the most formidable techniques is regression.
Regression models the relationship between one or more independent variables and one dependent variable. There are various types of regression models. One of the simplest (and you may remember this from high school) is simple linear regression. We don’t have time to go through regression here in detail, so if you want to have an in-depth exploration of regression, check out the following link.
Caution: Don’t blindly apply your regression techniques. Otherwise you will end up in the absurd situation outlined in the cartoon!
Suppose you want to categorize data into two or more groups. The canonical example: identifying which emails are spam and not spam (ham for the cool folks.) Suppose your email server gets an email. Based on your settings, and the nature of the email, it either moves it into your inbox, or your spam folder. How does it do that?
The mathematical tools that make this sort of classification possible are myriad. Some of the commonly used techniques are decision trees, support vector machines, naive Bayes and logistic regression. Again, for the purposes of this entry, we will not go into any detail of any of these methods. However, for more information about each, just follow the links!
Intuitively, you can think of clustering as another way of categorizing data. But this time, the category your data falls in may not be well defined. It is a way of grouping data into clusters such that the data that falls into each cluster is in some sense more similar to each other than in other groups. For example, at the time of this writing, the ratings of the top ten chess players in the world are as follows: 2860, 2816, 2810, 2809, 2800, 2786, 2783, 2780, 2776 and 2773.
These data roughly fall into three clusters: 2860 in a cluster of its own; 2816, 2810, 2809 and 2800 in another, and the rest in a third. (You might disagree with the number of clusters, or even the actual clustering itself, but that’s an argument for another day!)
There are several classes (or should I say categories 🙂 ) of clustering algorithms available: connectivity models based on a notion of distance, centroid models, distribution models based on statistical distributions and density models.
Each clustering algorithm comes with its own set of advantages and disadvantages. Of course, since this post is a mere introduction, we cannot go into detail of how these algorithms work. Here is a great link to get you started on your journey of exploration.
In order to use these techniques to their full advantage, a good foundation in statistics and probability is a must. We will go to far afield in this series to teach you the basics of statistics, but good introductory books on the subject abound. One of the author’s favorites is Statistics, by Freedman, Pisani and Purves.
So that concludes our admittedly very brief foray into some of the mathematical techniques used in the field. We hope you have found this useful. Feel free to explore more.
In our next post in this series, we will talk about some of the essential tools in the Data scientist’s toolbox: the R programming language, Matlab, Python, SAS, Julia and others. We hope to see you there. We also need to thank Dawar Dedmari whose valuable suggestions made this a better post. Any mistakes is the full responsibility of the author’s.