**Authors:**Pedro Domingos**Publication, Year:**Communications of the ACM, October 2012- Link to Article

A Few Useful Things to Know About Machine LearningKey InsightsBasic DefinitionsLearning = Representation + Evaluation + OptimizationIt's Generalization that CountsData Alone May Not Be EnoughOverfitting has Many FacesPowerful versus simpleIntuition Fails in High DimensionsTheoretical Guarantees Are Not What They SeemFeature Engineering Is The KeyMore Data Beats a Cleverer AlgorithmTwo Types of LearnersLearn Many Models, Not Just OneModel ensemblesBayesian model averagingSimplicity Does Not Imply AccuracyRepresentable Does Not Imply LearnableCorrelation Does Not Imply Causation

- Machine learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled.
- Machine learning is widely used in computer science and other fields. However, developing successful machine learning applications requires a substantial amount of “black art” that is difficult to find in textbooks.
- This article summarizes 12 key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.

**Classifier**— a system that inputs (typically) a vector of discrete and/or continuous**feature values**and outputs a single discrete value,**the class**.- A
**learner**inputs a training set of examples (, ), where = ( , . . . , ) is an observed input and is the corresponding output,**and outputs a classifier**.

**Representation**— How you represent the data that can be learned by the learner.- In other words, this representation defines the
**hypothesis space**of the potential learning

- In other words, this representation defines the
**Evaluation**— (a.k.a*objective function*or*scoring function*) — some function which defines whether or not the classifier is performing well or performing badly**Optimization**— The process of searching among classifiers in order to select the best one

The accompanying table shows common examples of each of these three components. Of course, not all combinations of one component from each column of the table make equal sense. For example, discrete representations naturally go with combinatorial optimization, and continuous ones with continuous optimization. Nevertheless, many learners have both discrete and continuous components, and in fact the day may not be far when every single possible combination has appeared in some learner!

The fundamental goal of machine learning is to generalize beyond the examples in the training set. This is because, no matter how much data we have, it is very unlikely that we will see those exact examples again at test time.

- Don't test and train on the same set of data —> this will lead to overfitting
- Don't do a lot of parameter tuning on test data

Contaminating test data can be mitigated by using cross-validation on your training data set. For example utilizing Leave-One-Out (LOO) or K-fold methods [see here for more details on both].

- Machine learning systems (learners) must employ some domain knowledge in order to be useful
- The way that learners work, is they take that small amount of domain knowledge as input and amplify it to create a lot of output knowledge. Like any amplifier, the more we put in (more domain knowledge), the more we get out.

The most useful learners in this regard are those that do not just have assumptions hardwired into them, but allow us to state them explicitly, vary them widely, and incorporate them automatically into the learning.

If we don't have the necessary data we run the risk of just hallucinating a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. **This problem is called overfitting**.

Understanding overfitting through **bias** and **variance** can be helpful. This is visualized below on the dart board.

Bias | Variance |
---|---|

The tendency to consistently learn the wrong thing | The tendency to learn random things irrespective of the real signal |

… beam search has lower bias than greedy search, but higher variance, because it tries more hypotheses. Thus, contrary to intuition, a more powerful learner is not necessarily better than a less powerful one. Figure 2 illustrates this. Even though the true classifier is a set of rules, with up to 1,000 examples naive Bayes is more accurate than a rule learner. This happens despite naive Bayes’s false assumption that the frontier is linear!

**Cross-validation can be helpful to avoid overfitting but it does not solve all problems.**

Other types of overfitting mitigation techniques:

**Adding a regularization term to the evaluation function.**- This can, for example, penalize classifiers with more structure, thereby favoring smaller ones with less room to overfit.

**Perform a statistical significance test**(like chi-square)**before adding new structure**- This can be helpful in deciding whether the distribution of the class is really different with and without the new structure

It is easy to avoid overfitting (variance) by falling into the opposite error of under-fitting (bias). Simultaneously avoiding both requires learning a perfect classifier, and short of knowing it in advance there is no single technique that will always do best.

**Common misconception —> Overfitting comes from noise.**

- While certain types of noise in your data can aggravate overfitting, there is no requirement of noise in order to create a classifier which has a severe overfitting problem.

**Multiple testing** — this is building on the problem of conducting multiple statistical tests and increasing your chances of a Type I error.

For example, a mutual fund that beats the market 10 years in a row looks very impressive, until you realize that, if there are 1,000 funds and each has a 50% chance of beating the market on any given year, it is quite likely that one will succeed all 10 times just by luck.

After overfitting, the biggest problem in machine learning is the

curse of dimensionality.

Generally speaking, our human intuitions about dimensionality cannot extend past three physical dimensions. Thus, as a result of this limitation, our intuitions may be useless (at best) or misleading (at worst).

Naively, one might think that gathering more features never hurts, since at worst they provide no new information about the class. But in fact their benefits may be outweighed by the curse of dimensionality.

For example, more features = more dimensions

- This can create a higher level of dimensional abstraction, as well as a larger potential solution space to search

Fortunately, there is an effect that partly counteracts the curse, which might be called the “blessing of non-uniformity.”

In most applications examples are not spread uniformly throughout the instance space, but are concentrated on or near a lower dimensional manifold.

This portion discusses probabilistic guarantees with respect to the likelihood (probability) that a particular classifier is bad, based on how much data is pulled.

This provides some type of "bound" on how much data is needed.

The author suggests that we should take these with a grain of salt, however, because they tend to be particularly "loose" and also to keep in mind that interesting problems may require an exponentially larger amount of data in order to search the hypothesis space.

This is explained a bit more clearly in the original text but I will simply end with this point because I think it's the most valuable...

Further, we have to be careful about what a bound like this means. For instance, it does not say that, if your learner returned a hypothesis consistent with a particular training set, then this hypothesis probably generalizes well. What it says is that, given a large enough training set, with high probability your learner will either return a hypothesis that generalizes well or be unable to find a consistent hypothesis.

The bound also says nothing about how to select a good hypothesis space. It only tells us that, if the hypothesis space contains the true classifier, then the probability that the learner outputs a bad classifier decreases with training set size. If we shrink the hypothesis space, the bound improves, but the chances that it contains the true classifier shrink also.

Another common type of theoretical guarantee is asymptotic: given infinite data, the learner is guaranteed to output the correct classifier.

In practice, we are seldom in the asymptotic regime (also known as “asymptopia”).

- Which features are used is the most important part of any learner
- Even if the needed data is not in a machine-friendly form, you can construct it to be
- Thus, much of the machine learning work is in constructing, cleaning, preparing and wrangling data prior to putting it through a learner

So there is ultimately no replacement for the smarts you put into feature engineering.

Clever algorithms are valuable, however, it is often more pragmatic to simply gather more data.

As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.)

This, however, raises the issue of **scalability**. There is lots and lots of data, but no one has the time to process it all.

This leads to a paradox: even though in principle more data means that more complex classifiers can be learned, in practice simpler classifiers wind up being used, because complex ones take too long to learn.

Part of the reason using cleverer algorithms has a smaller payoff than you might expect is that, to a first approximation, they all do the same.This is surprising when you consider representations as different as, say, sets of rules and neural networks. But in fact propositional rules are readily encoded as neural networks, and similar relationships hold between other representations.

All learners essentially work by grouping nearby examples into the same class; the key difference is in the meaning of “nearby.”

Good rule of to follow is to try simple learners first:

- Naïve Bayes before logistic regression
*k*—nearest neighbor before support vector machines

Two reasons why:

They are easier to use

It is much clearer how they are working

- This kind of information is extremely useful when trying to figure out how things are actually working. If more complex learners are necessary you will be confident that is the case and have a better idea of how exactly to build one based on what you've learned with the simpler models

**Fixed-size learners**— those whose representation has a fixed size, like linear classifiersCan only take on so much data

- Notice how the accuracy of naïve Bayes asymptotes at around 70% in [Figure 2]

**Variable-size learners**— those whose representation can grow with the data, like decision trees.- Sometimes called nonparametric learners, but this is somewhat unfortunate, since they usually wind up learning many more parameters than parametric ones
- Can in principle learn any function given sufficient data, but in practice they may not, because of limitations of the algorithm (for example, greedy search falls into local optima) or computational cost.
- Also, because of the curse of dimensionality, no existing amount of data may be enough.

For these reasons, clever algorithms—those that make the most of the data and computing resources available— often pay off in the end, provided you are willing to put in the effort. There is no sharp frontier between designing learners and learning classifiers; rather, any given piece of knowledge could be encoded in the learner or learned from data.So machine learning projects often wind up having a significant component of learner design, and practitioners need to have some expertise in it.

We have begun to find out that simply mixing models together is often the best way to improve performance.

Three simple versions are:

**bagging**— generate random variations of the training set by resampling, learn a classifier on each, and combine the results by voting- This works because it greatly reduces variance while only slightly increasing bias

**boosting**— provide weights to the training examples, these are then varied so that each new classifier focuses on the examples the previous ones tended to get wrong.**stacking**— the outputs of individual classifiers become the inputs of a “higher-level” learner that figures out how best to combine them.

Many other techniques exist, and the trend is toward larger and larger ensembles.

Model ensembles should not be confused with Bayesian model averaging (BMA)—**the theoretically optimal approach to learning**.

Predictions on new examples are made by averaging the individual predictions of all classifiers in the hypothesis space

- These are weighted by how well the classifiers can explain the training data and how much we believe in them
*a priori*

- These are weighted by how well the classifiers can explain the training data and how much we believe in them
Ensembles

**change the hypothesis space**(for example to a linear combination of them all) while the**BMA approach assigns weights based on the original hypothesis space of each individual classifier.**

BMA weights are extremely different from those produced by (say) bagging or boosting: the latter are fairly even, while the former are extremely skewed, to the point where the single highest-weight classifier usually dominates, making BMA effectively equivalent to just selecting it.

A practical consequence of this is that, while model ensembles are a key part of the machine learning toolkit, BMA is seldom worth the trouble.

This whole section is dedicated to pointing out that simpler ML models are not necessarily more accurate.

**However, from a practical perspective, simpler models are often more valuable because we can more easily understand and learn from them.**

Just because the data can be represented in one form does not mean that it can be learned in that form.

It can be valuable to try and represent the data in different ways for different learners to see if they perform better/worse.

Therefore the key question is not “Can it be represented?” to which the answer is often trivial, but “Can it be learned?” And it pays to try different learners (and possibly combine them).

The point within the header above is getting old.

That being said, strong correlational findings may help researchers discover different/new areas of research for testing causality.

Many researchers believe that causality is only a convenient fiction. For example, there is no notion of causality in physical laws. Whether or not causality really exists is a deep philosophical question with no definitive answer in sight, but

there are two practical points for machine learners.First, whether or not we call them “causal,” we would like to predict the effects of our actions, not just correlations between observable variables.Second, if you can obtain experimental data (for example by randomly assigning visitors to different versions of a Web site), then by all means do so.

Notes by Matthew R. DeVerna