Pedro Domingos (2012)
- Learning = representation + evaluation + optimization
- Representation: learner representation is important to the set of classifiers that can be learned
- Evaluation: objective function is needed to distinguish good classifiers from bad ones
- Optimization: optimization technique is key to learner efficiency + determines which classifier we end up with in case of multiple optima
- Generalization counts: generalize beyond the train set
- Watch out for contamination of test data
- Solution: cross-validation (hold out different validation set each time)
- Train error as surrogate for test error is dangerous, but a local optimum may be good enough!
- Data alone is not enough: no matter how much
- No free lunch: no learner > random for all possible functions
- Why is ML is still successful then?
- Real world functions are not drawn uniformly from all possible functions
- Induction >> deduction, requiring much less input knowledge to produce useful results
- Choosing a representation: consider which kinds of knowledge are easily expressed in it
- Overfitting has many faces: 100% train accuracy, 50% test accuracy instead of 75%/75%
- Bias: consistently learn the same wrong thing (linear learner for non-linear phenomena)
- Variance: tendency to learn random things irrespective of the real signal (e.g. decision trees suffer from this)
- Strong false assumptions > true weak ones (need more data)
- Cross-validation, regularization terms and statistical significance tests help
- Easy to avoid overfitting by underfitting (bias)
- Overfitting is only partially due to noise
- Correct when testing for many hypotheses (e.g. Bonferroni correction)
- Intuition fails in high-D:
- Curse of dimensionality: more features need exponentially more data, so more features/hyperparameters is not always better!
- Blessing of non-uniformity: for a certain application, most examples are near a lower-D manifold --> take advantage of this with learner, or use an algorithm to explicitly reduce dimensions
- Theoretical guarantees are not what they seem:
- The main role of such guarantees is not serving as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design
- "Asymptopia": bound on number of examples for good generalization
- Feature engineering is key: building features from raw data
- Domain-specific: incorporate specialist knowledge
- Automated feature engineering: generate large number of candidates, select best --> difficult, since features are not (always) independent
- More data > cleverer algorithm: quickest path to success is data
- To some extent, all algorithms do the same
- 2 types of learners:
- Parametric (e.g. linear regression): can only take advantage of so much data
- Non-parametric (e.g. decision trees): learn any function given enough data, but limitations on cost, etc. + curse of dimensionality!
- Learn many models, not just one: instead of trying many learners and selecting one, combine many and average
- Bagging: learn classifiers on random subsets, combine results by voting
- Greatly reduce variance, while increasing bias only slightly
- Boosting: each new classifier focuses on what the last one got wrong
- Stacking: individual classifiers output into a higher-level learner that combines them
- Trend: larger and larger ensembles
- Bayesian model averaging: predict by averaging predictions of all classifiers in hypothesis space, weighted based on belief
- Ensembles = changing hypothesis space, Bayesian model averaging = weighting parts of the hypothesis space
- Bagging: learn classifiers on random subsets, combine results by voting
- Simplicity does not imply accuracy
- Occam's razor: select hypothesis with fewest assumptions
- Cannot be true --> think of no free lunch and ensembles!
- However: smaller hypothesis spaces allow for better generalization
- Simpler hypotheses should be preferred since simplicity is a virtue
- Occam's razor: select hypothesis with fewest assumptions
- Representable does not imply learnable
- Correlation doesn't imply causation: but can be a sign of causality!
- Experimental data is needed to prove causation (including control group, etc.)
- Machine learning: usually observational data