O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
Próximos SlideShares
Machine Learning in Big Data
Machine Learning in Big Data
Carregando em…3
×
1 de 29

Machine Learning in Big Data

5

Compartilhar

Hadoop Summit 2015

Machine Learning in Big Data

  1. 1. Machine Learning in Big Data - Look forward or be left behind V. William Porto Hadoop Summit June 2015
  2. 2. 2  RedPoint Global Inc. 2015 Confidential Machine Learning – keeping ahead of the curve Three basic tenants for success in today’s world Prediction - you need to learn and use what you’ve learned Optimization - the world is a dynamic place Automation - because people don’t scale well
  3. 3. 3  RedPoint Global Inc. 2015 Confidential Machine Learning – why bother? If you have always done it that way, it is probably wrong” - Charles Kettering
  4. 4. 4  RedPoint Global Inc. 2015 Confidential Machine Learning – what really is it all about? Learning vs. instruction Humans learn instinctively – computers not so much Intelligent Systems Memory Prediction (modeling) Assessment Feedback Adaptation
  5. 5. 5  RedPoint Global Inc. 2015 Confidential Data Modeling – what, why, how Regression – what happened in the past Prediction – what will happen in the future “Prediction is very difficult – especially if it’s about the future” - Nihls Bohr
  6. 6. 6  RedPoint Global Inc. 2015 Confidential Data Modeling – what, why, how Choices, choices - the wide world of data modeling Supervised models you have historical data and known correlated outputs (truth) Unsupervised models historical data, but may not have (or trust) associated outputs
  7. 7. 7  RedPoint Global Inc. 2015 Confidential Supervised vs. Unsupervised Models
  8. 8. 8  RedPoint Global Inc. 2015 Confidential Linear Models Major Assumption: the world is linear Pros: the math is easy! fast execution Cons: the real world isn’t really linear all errors aren’t all equal easy to generate misleading results
  9. 9. 9  RedPoint Global Inc. 2015 Confidential Decision Trees Major Assumption: the world is discrete Pros: easy to understand fast execution no linearity assumptions Cons: lots of ‘human time’ to create bias in unbalanced trees some concepts need very large trees
  10. 10. 10  RedPoint Global Inc. 2015 Confidential Non-Linear Models Major Assumption: data is representative Pros: ‘universal’ modeling tools fast execution no linearity assumptions Cons: lots of parameters, many techniques training can be slow difficult to explain and understand Artificial Neural Network Bayesian Network
  11. 11. 11  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation Basic Question – which one describes the data the best? Raw data
  12. 12. 12  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation – group think Collaborative Filtering Relationship Matrix
  13. 13. 13  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation with Statistics Statistical Techniques: K-Means Vector Quantization Pros: relatively simple statistically-backed results Cons: assumptions: data distribution how many clusters really are there? K-Means Clustering Vector Quantization
  14. 14. 14  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation – data driven Feature Maps: Pros: lets data speaks for itself useful boundary relationships Cons: slow to train Customer Demographics
  15. 15. 15  RedPoint Global Inc. 2015 Confidential Model Selection – how to choose? Basic Model Type (prediction or segmentation) inputs + correlated outputs inputs only? Basic Questions: which one to use for my problem? parameters? is this the best choice? could I do better, and how?
  16. 16. 16  RedPoint Global Inc. 2015 Confidential Optimization – making the best choices Standard (old-school) Techniques: PCA, Partial Least Squares, etc. Pros: because the math is easy ! Cons: lots of (usually incorrect) assumptions new data = start from scratch
  17. 17. 17  RedPoint Global Inc. 2015 Confidential Optimization – is that the only way?
  18. 18. 18  RedPoint Global Inc. 2015 Confidential Optimization – Evolving better solutions Simulated Evolution Pros: fast, efficient search always have a solution arbitrary ‘evaluation’ functions can start with existing solution(s) Cons: CPU time + memory – but that’s why we have distributed processing!
  19. 19. 19  RedPoint Global Inc. 2015 Confidential Optimization – Evolving Models What does a ‘solution’ look like? model type parameters data (training + testing) Variation – alter model type, parameters Assessment – how well does the model work? Selection – survival of the fittest
  20. 20. 20  RedPoint Global Inc. 2015 Confidential Evolutionary Optimization in a Hadoop Environment Challenges: data partitioning distributed computation communication MapReduce
  21. 21. 21  RedPoint Global Inc. 2015 Confidential Optimization in a Hadoop Environment – what really works MapReduce: algorithmic task partitioning iterative tasks vs. fully compartmented tasks aggregation – distribution tasks communication / synchronization costs
  22. 22. 22  RedPoint Global Inc. 2015 Confidential ML in a Hadoop Environment – Single Algorithm Architecture Multi-Core Machine (per Chu and Kim, et. al 2006, Stanford NLPG) ML Algorithm Engine Master Mapper Mapper Mapper Mapper Data Reducer input reduce query info result query info map (split data)intermediate data
  23. 23. 23  RedPoint Global Inc. 2015 Confidential Machine Learning in a Hadoop Environment ML Algorithms: Locally Weighted Linear Regression K-Means Nearest Neighbor (KNN) Feed-forward Multi-layer Neural Network (MLP) Principal Component Analysis (PCA) Support Vector Machine (SVM)
  24. 24. 24  RedPoint Global Inc. 2015 Confidential Machine Learning in a Hadoop Environment – example Hadoop Multi-Core Tests (per Chu and Kim, et. al 2006, Stanford NLPG) # Processors Speed increase
  25. 25. 25  RedPoint Global Inc. 2015 Confidential ML in a Hadoop Environment – Evolutionary Optimization Architecture Offspring Partition Offspring Partition Map Initial (seed) Population Coordinator Map ... ... Offspring Partition Master (Variation) Reducer Reducer ... 1st reduction stage (local selection) 2nd reduction stage (global selection) Reducer Nth generation solutions map stage (evaluation)
  26. 26. 26  RedPoint Global Inc. 2015 Confidential Machine Learning – Hadoop, MPI, GPU? query info Analyze the algorithmic bottlenecks Use Hadoop / MapReduce if: large number of features relatively few inter-process communication steps e.g., on-line training Use MPI, GPUs if: large number of training samples e.g., batch training
  27. 27. 27  RedPoint Global Inc. 2015 Confidential Optimization – Don’t Stop Now Adaptation update models regularly drop old data, retrain Model with different time scales daily, weekly, seasonal, yearly, multi-year Automate the process !
  28. 28. 28  RedPoint Global Inc. 2015 Confidential A Word about RedPoint Global Launched 2006 Founded and staffed by industry veterans Headquarters: Wellesley, Massachusetts Offices in US, UK, Australia, Philippines Global customer base Serves most major industries MAGIC QUADRANT Data Quality MAGIC QUADRANT Multichannel Campaign Management MAGIC QUADRANT Integrated Marketing Management
  29. 29. 29  RedPoint Global Inc. 2015 Confidential Time for Q&A For more information contact: Bill Porto RedPoint Global Inc. 36 Washington St., Suite 120 Wellesley Hills, MA 02481 vwporto@redpoint.net

×