O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

AI/ML as an empirical science

908 visualizações

Publicada em

A discussion of the nature of AI/ML as an empirical science. Covering concepts in the field, how to position ourselves, how to plan for research, what are empirical methods in AI/ML, and how to build up a theory of AI.

Publicada em: Ciências
  • Seja o primeiro a comentar

AI/ML as an empirical science

  1. 1. 25/06/2020 1 Empirical observations of empirical AI research Concepts ● Positioning ● Planning ● Empirical methods ● Theorization
  2. 2. 25/06/2020 2 Empirical AI Domain expert Rigorous AI
  3. 3. Empirical != Ad-hoc Empirical != Applied research 25/06/2020 3 Origin: Greek empeirikos ("experienced") Source: QuestionPro
  4. 4. http://people.eku.edu/ritchisong/554notes2.html HEAVIER-THAN-AIR FLIGHT: OBSERVATION
  5. 5. Sources: http://aero.konelek.com/aerodynamics/aerodynamic-analysis-and-design http://www.foolishsailor.com/Sail-Trim-For-Cruisers-work-in-progress/Sail-Aerodynamics.html Enabling factors • Aerodynamics • Powerful engines • Light materials • Advances in control • Established safety practices HEAVIER-THAN-AIR FLIGHT: THEORY
  6. 6. WHY EMPIRICAL? Unless theoretically proven, we need realistic validation.  Physics theories are great, but they are based on assumptions  need experimental validation Toy theories don’t scale  have zero impact  Winter AI.  Hence, the theories have gone down a bit, e.g., down the Chomsky hierarchy in NLP, and down to surface pattern recognition in CV. Real difficult problems lead to new theoretical set ups. Empirical observations offer insights, suggests problem formulation, induction and abduction Push the field forward. AI now strongly driven by industry. 25/06/2020 6
  7. 7. EMPIRICAL RESEARCH != IGNORANCE OF THEORY Theory guides the search. It quantifies the goals. Making sure what we obtained is correct. Current AI theories:  Optimization. Mostly non-convex. Loss landscape of deep nets.  Learnability and generalization. Learning dynamics, convergence and stability.  Probabilistic inference. Quantification of uncertainty. Algorithmic assurance.  First-order logic.  Constraint satisfaction. 25/06/2020 7
  8. 8. IS EMPIRICAL ML/AI DIFFICULT? Yes! This is the real world. It is messy and must be difficult. Robotics Healthcare Drug design 25/06/2020 8 No! The “effective” solution space is small and constrained. Machine translation Self-driving cars Drug design
  9. 9. TWO RESEARCH STYLES Methodological: One method, many application scenarios. The method will fail at some point. Applied: One app, many competing methods. Some methods are intrinsically better than others. E.g., CNN for image processing. 25/06/2020 9
  10. 10. HOW TO POSITION YOURSELF 25/06/2020 10 “[…] the dynamics of the game will evolve. In the long run, the right way of playing football is to position yourself intelligently and to wait for the ball to come to you. You’ll need to run up and down a bit, either to respond to how the play is evolving or to get out of the way of the scrum when it looks like it might flatten you.” (Neil Lawrence, 7/2015, now with Amazon) http://inverseprobability.com/2015/07/12/Thoughts-on-ICML-2015/
  11. 11. “Then ask yourself, what is your unique position? What are your strengths and advantages that people do not have? Can you move faster than others? It may be by having access to data, access to expertise in the neighborhood, or borrowing angles outside the field. Sometimes digging up old ideas is highly beneficial, too. Alternatively, just calm down, and do boring-but-important stuffs. Important problems are like the goal areas in ball games. The ball will surely come.” https://letdataspeak.blogspot.com/2016/12/making-dent-in-machine-learning-or-how.html 25/06/2020 11
  12. 12. WISDOM OF THE ELDERS Max Planck  Science advanced by one death at a time Geoff Hinton:  When you have a cool idea but everyone says it is rubbish, then it is a really good idea.  The next breakthrough will be made by some grad students Richard Hamming:  After 7 years you will run out of ideas. Better change to a totally new field and restart.  If you solve one problem, another will come. Better to solve all problems for next year, even if we don’t know what they are. http://paulgraham.com/hamming.html
  13. 13. STRATEGIES TO SELECT RESEARCH AREAS Ask what intelligence means, and what human brain would do Set an intelligence level (bacteria, insect, fish, mouse, dog, monkey, human) Check solvable unsolved problems. There are LOTS of them!!! Look for impact Look for being difference. Fill the gap. Be unique. Be recognizable. Riding the early wave. 25/06/2020 13 Ask what a physicist would do:  Simplest laws/tricks that would do 99% of the job  Minimizing some free-energy  Looking for invariance, symmetry
  14. 14. INSIGHTS FROM NEUROSCIENCE Brain is the only known intelligence system Can provide inspirations for AI Can validate AI claims Hassabis, Demis, et al. "Neuroscience-inspired artificial intelligence." Neuron95.2 (2017): 245-258. Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural networks 61 (2015): 85-117. Indeed, many AI concepts have had originated from neuroscience  However, the influence is only at the initial step  AI doesn’t have the constraints that neuroscience does  AI can do things that are impossible within neuroscience
  15. 15. NEUROSCIENCE CONCEPTS Memory Reinforcement learning Control Attention Planning Imagination Symbolic manipulation Emotion Consciousness Transfer learning Continual learning Spatial memory Social interaction Developmental Catastrophic forgetting Nature versus nurture
  16. 16. 25/06/2020 16 #REF: A Cognitive Architecture Approach to Interactive Task Learning, John E. Laird, University of Michigan
  17. 17. 25/06/2020 17 #REF: A Cognitive Architecture Approach to Interactive Task Learning, John E. Laird, University of Michigan Knowledge graphs Planning, RL NTM Attention, RL RLUnsupervised CNN Episodic memory Where is the processor?
  18. 18. THINKING & LEARNING, FAST AND SLOW Thinking Fast: reactive, primitive, associative, animal like Slow: deliberative, reasoning, logical. Learning Fast: often memory-based, one-shot learning style, fast-weights, e.g., k-NN Slow: often model- based, iterative, slow-weights, e.g., backgrop 25/06/2020 18
  19. 19. UNSOLVED PROBLEMS (1) Learn to organize and remember ultra-long sequences Learn to generate arbitrary objects, with zero supports Reasoning about object, relation, causality, self and other agents Imagine scenarios, act on the world and learn from the feedbacks Continual learning, never-ending, across tasks, domains, representations Learn by socializing Learn just by observing and self-prediction Organizing and reasoning about (common- sense) knowledge Grounding symbols to sensor signals Automated discovery of physical laws Solve genetics, neuroscience and healthcare Automate physical sciences Automate software engineering
  20. 20. UNSOLVED PROBLEMS (2) Click here 25/06/2020 20
  21. 21. LECUN’S LIST (2018) Deep Learning on new domains (beyond multi-dimensional arrays) Graphs, structured data... Marrying deep learning and (logical) reasoning Replacing symbols by vectors and logic by algebra Self-supervised learning of world models Dealing with uncertainty in high-dimensional continuous spaces Learning hierarchical representations of control space Instantiating complex/abstract action plans into simpler ones Theory! Compilers for differentiable programming.
  22. 22. A TYPICAL EMPIRICAL PROCESS Hypothesis generation Mostly by our brain But AI can help the process too! Hypothesis testing by experiments Two types: Discovery: it is there, awaiting to be found. Mostly theories of nature. Invention: not there, but can be made. Mostly engineering. 25/06/2020 22
  23. 23. HOW TO PLAN AN EMPIRICAL RESEARCH Get the right people. Smart. Self-motivated. Can read maths. Can code. Can do dirty data cleaning. Choose a right topic. Impactful ones at the right time. Looks for current failures. Set the targets. Explore theoretical options. Breaking into solvable parts  Before: 3 years of PhD, one problem.  Now: every 3-6 months to meet a conference deadline (Feb-March, May-June, Aug-Sept, Nov-Dec) Collect data. LOTS of them. Clean up. Code it up. In DL, it means using your favourite programming frameworks (e.g.,PyTorch, TensorFlow) Evaluate the options on some small data, then big data, then a variety of data Repeat 25/06/2020 23 Question: When should we stop evaluating?
  24. 24. HOW TO BE PRODUCTIVE AND BE HAPPY Maximize your quantity and quality of enlightenment. Fail fast. Fail smart. Start small. Scale latter. Use simplest possible method to confirm or reject a hypothesis If the hypothesis is partially confirmed, refine the hypothesis, or improve the method, and repeat. Time spent per experiment should be limited to an hour or similar range. Winning code seems to run for 2 weeks. Productivity = knowledge unit gained / time unit. 25/06/2020 24
  25. 25. ML TALES “Machine learning has very high expectations in terms of  very rapidly producing a lot of successful work and  exerting influence over the firehose of everyone else’s rapidly produced work.  Ilya Sutskever has over 50,000 Google Scholar citations,  None of the four most recent Fields Medal winners has over 5,000 citations.” https://veronikach.com/how-i-fail/how-i-fail-ian-goodfellow-phd14-computer-science/ 25/06/2020 25
  26. 26. WHAT INVOLVED IN AN AI/ML PAPER? Something wrong, doesn’t work, or need to be done (if not then the cost of inaction is too high), or things overlooked by previous work.  Must be related to intelligence (learning, reasoning, etc). A high-level model of the problem (e.g., graphical models, RNN, memory, GAN, etc). A math problem resulting from optimizing the model (e.g., convergence, stability, generalization etc). A math solution (e.g., a theorem, an end-to-end solution) An engineering solution for scaling, speeding up, reducing data needed, etc. Lots of experiments (synthetic and realistic) to:  Validate that the solutions are correct  Validate that the solutions are right  Show evidence that the problems are worth solving.  Hint of the future impact, works or follow-up. 25/06/2020 26
  27. 27. CONSTANT FAILURE AS PART OF WORKFLOW Failure is a rule, not exception See Ian Goodfellow’s case in https://veronikach.com/how-i- fail/how-i-fail-ian-goodfellow-phd14-computer-science/  “Failure of specific ideas is just a constant part of my workflow.” Those not rejected are not trying hard enough  Hinton: models not overfitting are too small  Management wisdom: Staff are always incompetent But don’t come back to your boss/supervisor every single time with a failure report! 25/06/2020 27
  28. 28. WHAT CAN WE LEARN FROM FAILURE? The idea is fundamental wrong, or Hyper-parameters, esp. for deep networks, are wrong, or There are software bugs, or The expectation is wrong. It is important to know when, where and why things don’t add up.  Too good results are usually wrong.  Too bad results are usually wrong. 25/06/2020 28
  29. 29. CHOOSING HIGH IMPACT PROBLEMS Impact to the population (e.g., 100M people or more – Andrew Ng) Start a new paradigm – a new trend Ride the wave No need to have a competitive mind – “Mine beats yours” 25/06/2020 29
  30. 30. CHOOSING IMPORTANT BUT LESS SHINY PROBLEMS Computer vision & NLP are currently shinny areas. Dominated by industry. You can’t compete well without good data, computing resources, incentives and high concentration of talent. Some of the best lie in the intersection of areas. Esp. those with some domain knowledge (which you can learn fast).  Biomedicine, e.g., drug design, genomics, clinical data.  Sustainability, e.g., Earth health, ocean, crop management, land use, climate changes, energy, waste management.  Social good, law, developing countries, access to healthcare and education, clean water, transportation.  Value-alignment, ethics, safety, assurance.  Accelerating sciences, e.g., molecule design/exploration, inverse materials design, quantum worlds. 25/06/2020 30
  31. 31. RICHARD HAMMING … AGAIN "The average scientist [....] spends almost all his time working on problems which he believes will not be important and he also doesn't believe that they will lead to important problems." 25/06/2020 31
  32. 32. HOW TO BE PRODUCTIVE AND BE HAPPY Maximize your quantity and quality of enlightenment. Fail fast. Fail smart. Start small. Scale latter. Use simplest possible method to confirm or reject a hypothesis If the hypothesis is partially confirmed, refine the hypothesis, or improve the method, and repeat. Time spent per experiment should be limited to an hour or similar range. Winning code seems to run for 2 weeks. Productivity = knowledge unit gained / time unit. 25/06/2020 32
  33. 33. PATTERNS REPEATED EVERYWHERE Lots of problems can be formulated in similar ways We often need an intermediate representation, e.g., graphs that can be seen in many fields. Solving the generic problems are often faster and generalizable. See Sutton’s remarks. Many domains can be solved at the same time. After all, we have just one brain that does everything. Methods that survive need to be simple, generic and scalable. PCA, CNN, RNN, SGD are examples. 25/06/2020 33
  34. 34. METHODS VERSUS CONCEPTS Methods come and go, but concepts stay  Decision trees, kernels  Template matching, data geometry  Random forests, gradient boosting  Smoothing, variance reduction  Neural nets  Differentiable programming, function reuse, credit assignment “Machine learning is the new algorithms”  https://nlpers.blogspot.com.au/2014/10/machine-learning-is-new- algorithms.html
  35. 35. PARADIGM, FRAMEWORK, ALGORITHMS Paradigms are major shifts in each science, happening not so often. Could be centuries, decades. Frameworks are specific design within a paradigm (e.g., differentiable programming) Algorithms are specific realization of framework, aiming at efficiency and scalability (e.g., quick sort). 25/06/2020 35
  36. 36. BIG INNOVATION = SYNERGY OF SMALL IMPROVEMENTS Case-study: CRF (2001):  Conditional MRF? Really?  Log-linear parameterization? Since early 1970s.  Maximum entropy? Since 1957.  Forward-backward algorithm? Special case of belief-propagation. Since 1988.  Large-scale convex optimization? Since early 1990s. So why the big deal?  Right time: lot of data, big models, computers are strong enough, ML starts taking off.  Nice combination of many components, each of which was not very well-known in the ML community.  Importantly, the empirical results are fantastic compared to HMM and MEMM. 25/06/2020 36
  37. 37. TREND IN AI/ML: HOMOGENIZATION CNN: convolution RNN: translation Transformer: attention ResNet: iterative refinement FiLM: gated iterative refinement R. Sutton: learning + search D. Knuth: search + sort 25/06/2020 37
  38. 38. RULES OF INNOVATION Filling the missing blocks – axis of innovation  E.g., single  multiple; flat  hierarchy; Combinatorial innovation  E.g., MRF + MaxEnt -> CRF Three words rule:  E.g., PCA, SVM, RNN, CNN, GCN, ICA, HMM, ANN, SNE, LLE, MAP, MRF, CRF, DRF, IBP, HDP, MDP, GMM, GBM, GAN, VAE, NSM, NTM, DNC, DCA, TDA, CCA, LPP. Timing  Too early works are not appreciated.  Too late works are not appreciated. 25/06/2020 38
  39. 39. ACQUIRING A GOOD TASTE Like music, film and fashion, taste is important in research. Andrej Karpathy:  “When you pitch a potential problem to your adviser you’ll […] sense the excitement in their eyes as they contemplate the uncharted territory ripe for exploration. In that split second a lot happens: an evaluation of the problem’s importance, difficulty, its sexiness, its historical context [...]”.  “A lot of the problems I was excited about at the time were in retrospect poorly conceived, intractable, or irrelevant.” A tricky part:  “Even if you think your taste is impeccable, that doesn't matter one bit; to publish papers and earn a Ph.D., you need to do work that resonates with senior researchers in your field.”  A.k.a., the feeling of in-group belonging! 25/06/2020 39
  40. 40. WHAT IS A GOOD PROBLEM? BY ANDREJ KARPATHY, 2016 A fertile ground: a chain of papers. In short, a research program. Think ahead. Pay attention to the field – how it is moving forward.  Latter: Ride the wave. Plays to your adviser’s interests and strengths: Oh, well! More support, promotion, connection, better quality advice, sense of alignment, sense of group contribution. Be ambitious: […] thinking 10x forces you out of the box, to confront the real limitations of an approach, to think from first principles, to change the strategy completely, to innovate. Ambitious but with an attack: important problems != great projects. Hamming: having an attack makes a problem important. 25/06/2020 40
  41. 41. TASTE AND RESEARCH PERSONALITY Scientist – get into the fundamental of things, curious on how things work, without worrying about the application  Mostly single method/problem for a long time. Many possible applications can result. Engineer – want to solve real world problems  Single application for a long time. Many possible methods that apply.
  42. 42. BUILDING GRAND THEORIES More like meta-theories or paradigms (e.g., parallel distributed processing – PDP, AGI) They inspire research programs (e.g., Representation learning, deep learning, machine reasoning) Mostly not harmed by refutation (e.g., Neural networks didn’t work in the 1990s) Have sceptical critics as well as enthusiastic supporters (e.g., symbolic vs sub- symbolic). Why theory?  Those who built theory usually have a better scientific career.
  43. 43. PARADIGM, THEORY AND MODEL Paradigm is like meta-theory, within it there are many competing theories A theory concerns the nature of processes in a particular domain A model is an account of specific tasks and the behaviour associated with them.  Hence, theories normally originate with thought about the [intelligence] mechanisms, and models from attempts to provide specific accounts of empirical data.
  44. 44. WHAT INVOLVED IN AN AI/ML PAPER? Something wrong, doesn’t work, or need to be done (if not then the cost of inaction is too high), or things overlooked by previous work.  Must be related to intelligence (learning, reasoning, etc). A high-level model of the problem (e.g., graphical models, RNN, memory, GAN, etc). A math problem resulting from optimizing the model (e.g., convergence, stability, generalization etc). A math solution (e.g., a theorem, an end-to-end solution) An engineering solution for scaling, speeding up, reducing data needed, etc. Lots of experiments (synthetic and realistic) to:  Validate that the solutions are correct  Validate that the solutions are right  Show evidence that the problems are worth solving.  Hint of the future impact, works or follow-up. 25/06/2020 44
  45. 45. MORE ON AXES OF INVENTION Unstructured  structured (vector & set  seq, tree, grid, graph) Low dim  High dim Dense  Sparse Uninformative prior  informative prior 25/06/2020 45 Homogeneous  Heterogeneous Single  Ensemble One scale  Nesting Static  Dynamic, phase transition Diverse  Universal Non-human  with human;
  46. 46. CASE-STUDY: BERT Non-contextual embedding  Contextual embedding Leverage on lots of engineering advances Engineering feat, NOT a conceptual breakthrough Related to pseudo-likelihood (invented in 1975!) and Markov random fields (invented in the 1970s or even earlier)  Whole sentence language model was done in the 1990s. Can be beaten quickly (e.g., by XLNet, just in a year). 25/06/2020 46
  47. 47. RIDING THE EARLY WAVES 25/06/2020 47 Neural network, cycle 1: 1985-1995  Backprop SVM:1995-2005  Nice theory, strong results for moderate data Ensemble methods: 1995-2005  Nice theory, strong results for moderate data Probabilistic graphical models: 1992-2002  Nice theory, open up new thinking, difficult in practice Topic models & Bayesian non-parametrics: 2002-2012  Nice theory, indirect applications Feature learning: 2005-2015  Little theory, toy results, great excitements Neural network, cycle 2: 2012-present  Differentiable programming, backprop + adaptive SGD  Little theory, strong results for large data For PhD students: be an expert when the trend is at its peak!
  48. 48. MY OWN JOURNEY In 2004 I bet my PhD thesis on “Conditional random fields”. It died when I finished in 2008. In 2007 I bet my theoretical research area on Deep Learning, back then it was Restricted Boltzmann Machines. RBM died around 2012. But the field moves on and now reaches its peak. In 2012 I bet my applied research area on Biomedicine. This is super-hot now. In 2017 I am betting on “learning to reason”  cognitive architectures.
  49. 49. THE CURSE OF TEXT BOOKS Truyen’slaw: “whenever a universally accepted textbook is published, the field is saturated.” kernels (2002), probabilistic graphical models (2006), structured output learning (2007), learning to rank (2009), Bayesian non-parametrics (2012), and deep learning (2016). decision tree (1984), neural nets, cycle 1 (1995), information retrieval (1998), reinforcement learning (1998), multi-agent system (2001), ensemble methods (2001),
  50. 50. HOW MUCH COMPUTING POWER? The larger the better. It gives you freedom.  Sutton (“the bitter lesson”): generic methods + lots of compute win!  See the history of speech recognition, machine translation. However, for faster discovery and learning, better maximize the number of reasonably good GPUs  more people, more parallel experiments. On the other hand, little compute forces us to be smart. There are lots of important problems with lean data. See “You and your research” by Richard Hamming. Our brain uses very little energy (somewhere like 10 Watts). There must be clever ways to arrange computation. 25/06/2020 50
  51. 51. “SO YOU WANT TO BE A RESEARCH SCIENTIST” BY VINCENT VANHOUCKE, NOV 2018 Research is about ill-posed questions with multiple (or no) answers You will spend your entire career working on things that don’t work Your work will probably be obsolete the minute you publish it With infinite freedom comes infinite responsibility Paradoxically, much of research is about risk management You will need to retool often You’ll have to subject yourself to intense scrutiny Your entire career will largely be measured by one number You won’t work a day in your life 25/06/2020 51https://medium.com/s/story/so-you-want-to-be-a-research-scientist-363c075d3d4c
  52. 52. WHAT TO OPTIMIZE? Vincent Vanhoucke:  “Measuring progress in units of learning, as opposed to units of solving, is one of the key paradigm shifts one has to undergo to be effective in a research setting.” Meaning?  Read/think/write more, code less.  Give priority to depth over breath.  If code, choose fast prototyping.  If run experiments, make it small and fail fast, and learn from there.  After acquiring certain empirical experience, don’t just keep doing that. The return will be diminishing. 25/06/2020 52

×