Machine Learning Chapter 11
Many people believe data science is machine learning and that data scientists mostly build and train and tweak machine-learning models. In reality, data science is mostly addressing business problems by collecting, understanding, cleaning, and formatting data. But, once the data is prepared, you may have a chance to apply machine learning techniques.
A specification of a mathematical (or probabilistic) relationship that exists between different variables. For example, poker on television estimates each player’s “win probability” in real time based on a model that takes into account the cards that have been revealed so far and the distribution of cards in the deck.
What is Machine Learning?
Simply put, machine learning is creating and using models that are learned from data. In other contexts, it might be called predictive modeling or data mining.
Predicting whether an email message is spam or not
Predicting whether a credit card transaction is fraudulent
Predicting which advertisement a shopper is most likely to click on
Predicting which football team is going to win the Super Bowl
This book looks at both supervised and unsupervised models. Supervised models contain a set of data labeled with the correct answers to learn from, while unsupervised models do not contain such labels.
Overfitting and Underfitting
A common danger in machine learning is overfitting which is when a model is produced that performs well on the data you train it on but not on any new data. The other side of this is underfitting.
Usually, the choice of a model involves a trade-off between precision and recall. Precision measures the accuracy of positive predictions while recall measures what fraction of the positives our model identified.
The Bias-Variance Trade-off
Overfitting may be considered a trade-off between bias and variance. Both are measures of what would happen if you were to retrain your model many times on different sets of training data (from the same larger population).
Feature Extraction and Selection
When your data doesn’t have enough features, your model is likely to underfit. And when your data has too many features, it’s easy to overfit.
For example, imagine trying to build a spam filter to predict whether an email is junk or not. Most models won’t know what to do so you’ll have to extract features such as:
Does the email contain the word “Viagra”?
How many times does the letter d appear?
What was the domain of the sender?”
Containing “Viagra” is yes or no or boolean encoded as a 1 or 0. The second question is a number. And the third question is a choice from a discrete set of options.
Usually, we’ll extract features from our data that fall into one of these three categories. The type of features we have constrains the type of models we can use:
The Naive Bayes classifier yes-or-no features.
Regression models require numeric features (which could include dummy variables that are 0s and 1s).
Decision trees can deal with numeric or categorical data
All these models are more will be covered in the following chapters.
This post was an excerpt for our new book Data Science From Scratch Summary
Featured image photo credit https://flic.kr/p/6juihZ