Skip to content

Latest commit

 

History

History
57 lines (56 loc) · 4.29 KB

dl_01.md

File metadata and controls

57 lines (56 loc) · 4.29 KB

A few useful things to know about machine learning

Pedro Domingos (2012)

Key points

  • Learning = representation + evaluation + optimization
    • Representation: learner representation is important to the set of classifiers that can be learned
    • Evaluation: objective function is needed to distinguish good classifiers from bad ones
    • Optimization: optimization technique is key to learner efficiency + determines which classifier we end up with in case of multiple optima
  • Generalization counts: generalize beyond the train set
    • Watch out for contamination of test data
    • Solution: cross-validation (hold out different validation set each time)
    • Train error as surrogate for test error is dangerous, but a local optimum may be good enough!
  • Data alone is not enough: no matter how much
    • No free lunch: no learner > random for all possible functions
    • Why is ML is still successful then?
      • Real world functions are not drawn uniformly from all possible functions
      • Induction >> deduction, requiring much less input knowledge to produce useful results
    • Choosing a representation: consider which kinds of knowledge are easily expressed in it
  • Overfitting has many faces: 100% train accuracy, 50% test accuracy instead of 75%/75%
    • Bias: consistently learn the same wrong thing (linear learner for non-linear phenomena)
    • Variance: tendency to learn random things irrespective of the real signal (e.g. decision trees suffer from this)
    • Strong false assumptions > true weak ones (need more data)
    • Cross-validation, regularization terms and statistical significance tests help
    • Easy to avoid overfitting by underfitting (bias)
    • Overfitting is only partially due to noise
    • Correct when testing for many hypotheses (e.g. Bonferroni correction)
  • Intuition fails in high-D:
    • Curse of dimensionality: more features need exponentially more data, so more features/hyperparameters is not always better!
    • Blessing of non-uniformity: for a certain application, most examples are near a lower-D manifold --> take advantage of this with learner, or use an algorithm to explicitly reduce dimensions
  • Theoretical guarantees are not what they seem:
    • The main role of such guarantees is not serving as a criterion for practical decisions, but as a source of understanding and driving force for algorithm design
    • "Asymptopia": bound on number of examples for good generalization
  • Feature engineering is key: building features from raw data
    • Domain-specific: incorporate specialist knowledge
    • Automated feature engineering: generate large number of candidates, select best --> difficult, since features are not (always) independent
  • More data > cleverer algorithm: quickest path to success is data
    • To some extent, all algorithms do the same
    • 2 types of learners:
      • Parametric (e.g. linear regression): can only take advantage of so much data
      • Non-parametric (e.g. decision trees): learn any function given enough data, but limitations on cost, etc. + curse of dimensionality!
  • Learn many models, not just one: instead of trying many learners and selecting one, combine many and average
    • Bagging: learn classifiers on random subsets, combine results by voting
      • Greatly reduce variance, while increasing bias only slightly
    • Boosting: each new classifier focuses on what the last one got wrong
    • Stacking: individual classifiers output into a higher-level learner that combines them
    • Trend: larger and larger ensembles
    • Bayesian model averaging: predict by averaging predictions of all classifiers in hypothesis space, weighted based on belief
    • Ensembles = changing hypothesis space, Bayesian model averaging = weighting parts of the hypothesis space
  • Simplicity does not imply accuracy
    • Occam's razor: select hypothesis with fewest assumptions
      • Cannot be true --> think of no free lunch and ensembles!
      • However: smaller hypothesis spaces allow for better generalization
      • Simpler hypotheses should be preferred since simplicity is a virtue
  • Representable does not imply learnable
  • Correlation doesn't imply causation: but can be a sign of causality!
    • Experimental data is needed to prove causation (including control group, etc.)
    • Machine learning: usually observational data