O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Prepare your data for machine learning

11.835 visualizações

Publicada em

If there is one crucial thing in building ML models, this would be the data preparation. That is the process of transforming raw data to a state where machine learning algorithms could be run to disclose insights and make predictions. Data preparation involves analysis, depends on the nature of the problem and the particular algorithms. As far as there are knowledge and experience involved, there is no such thing as automation, which makes the role of the data scientist the key to success.
ML is trendy and Microsoft already have more than 10 services to support ML. So we will focus on tools like Azure ML Workbench and Python for data preparation, review some common tricks to approach data and experiment in Azure ML Studio.

Publicada em: Software
  • The #1 Woodworking Resource With Over 16,000 Plans, Download 50 FREE Plans... ➤➤ http://tinyurl.com/yy9yh8fu
    Tem certeza que deseja  Sim  Não
    Insira sua mensagem aqui
  • Seja a primeira pessoa a gostar disto

Prepare your data for machine learning

  1. 1. Prepare Your Data for Machine Learning Ivelin Andreev
  2. 2. Sponsors Gold Sponsors Innovation Sponsor Bronze Sponsors PASS
  3. 3. About meAbout me • Software Architect @ o 16+ years professional experience • Microsoft Azure MVP • External Expert Eurostars-Eureka, InnoFund Denmark • External Expert Horizon 2020 • Business Interests o Web Development, SOA, Integration o IoT, Machine Learning, Computer Intelligence o Security & Performance Optimization • Contact o ivelin.andreev@icb.bg o www.linkedin.com/in/ivelin o www.slideshare.net/ivoandreev
  4. 4. Agenda • Microsoft Tools for ML • The Data Science Process (Step by Step) • Data Preparation • DEMO
  5. 5. ML tools in the Microsoft World Data preparation Building models Consuming models
  6. 6. Machine Learning and Microsoft • Azure ML integrated, end-to-end data science and advanced analytics • Microsoft ML related services/tools • Highlights o Built on open source technologies (Jupyter Notebook, Spark, Python, Docker) o Execute experiments in isolated environments o GPU-enabled VMs DEPRECATED MAINTAINED AND IMPROVED • (Azure ML Workbench) • Azure ML Studio • Visual Studio Code Tools for AI • (Azure ML Experimentation Service) • Data Science VM • Microsoft Cognitive Services, LUIS.ai • (Azure ML Model Management Service) • Azure Databricks • Libraries for Apache Spark (MMLSpark) Now called: • Cognitive Toolkit (CNTK) • ML Services for SQL Server (R, Python) “Machine Learning Service” (preview) • Azure Batch AI Training
  7. 7. Azure ML Workbench • Desktop application (Windows, macOS) with • Built-in Jupyter Notebook services and Git integration • End-to-end process support o Model development and experimentation (Python) o Powerful inspectors for data analysis o Data transformations by example o Model history and deployment • Easy to use and resource hungry  * Replaced in Sept 24 2018 release to make way for an improved architecture (ref. to Azure ML SDK for Python or Azure Databricks for big datasets)
  8. 8. Azure ML Studio • Visual workspace to build, test and deploy ML solutions • Highlights o X-browser drag and drop, no programming o Rich set of modules o Fits beginners and advanced users o Unlimited extensibility (R Script, Python Script) o Enterprise grade cloud service (SLA 99.95%) o ML REST web services consumption o Jupyter Notebook o Azure AI Gallery (9000+ samples) • At what price? o Free plan available (10GB storage, 2 web services, 1000 requests/month) o $10 seat/month + $1 experiment/hour o Recommended: $100/month (unlimited storage, 10 web services, 100K requests)
  9. 9. Azure Data Science VM • Pre-configured cloud environment for AI & Data Science • Highlights o Fully operational environment o 50+ tools DEV, ML, BigData, Data management o Windows and Linux (Ubuntu/CentOS) o Updated every few months o On-demand elastic capacity o GPU optimized VMs for deep learning o Up to 4x GPUs NV K80 or V100 o Up to 128 vCPU, up to 6’144 GiB RAM • At what price? o From $11.76/month to $14’314/month
  10. 10. • Cloud-based environment to develop, train, test, deploy, manage, and track ML models • Highlights • Model management • Distributed deep learning • Version control and reproducibility • Hybrid deployment (Local, Cloud, Edge) • Automated modeling and tuning (algorithm and parameters) • Latest open source technologies (TensorFlow, PyTorch, Jupyter, Docker) • Scale up or out with large GPU-enabled clusters in the cloud • At what price? • From $23.51/month to $29’143.94/month Azure ML Service (preview)
  11. 11. The purpose of ML modelling is: • Generate predictions • Understand true relations
  12. 12. Machine Learning Challenges • Asking the right questions • Typically 1 Model = 1 Question • Requires training data o Real-world data is messy (wrong or missing data) o Feature engineering transforms to predictive features o Feature extraction ( i.e. IP Address -> population density) o Feature selection for informative features • Overfitting model o “Kicks ass” while training , o fails badly on real predictions • Model validation o “Sense” how well model works on new data
  13. 13. • Appealing o 64% believe they are working in this century’s most “sexiest” job • In demand o 90% contacted at least once a month with job offer o 50% - weekly, 30% - several times/week, 35% have <2y experience • The dark side… o All models are wrong, some are useful o 80% time is data preparation o Real life, not academic problems o Non-linear hypothesis testing o No full automation • No one cares how you do it • Presentation is the key The Data Scientist Job
  14. 14. Iterative ML Process
  15. 15. Data Understanding (Titanic Dataset) • Mosaic plot o Categorical distribution o Visualizes the relation between X and Y o Strong relation = Y-splits are far apart o Conclusion: Women have higher survival rate • Box plot o Continuous distribution of numeric var o IQR = middle 50% o Identify outliers [Q1-1.5 IQR; Q3+1.5 IQR] o Conclusion: High fares have higher survival rate • Scatter plot o How much a variable determines another o Conclusion: Infants and men 25-45 y have higher survival rate
  16. 16. • Make features usable o Numerical o Categorical (i.e. week day) o PCA dimensionality reduction o Dummy variables • Handle missing data • Normalize data o Standard range of numerical scale (i.e. from [-1000;1000] -> [0;1], [-1;1]) o Value range influence the importance of the feature compared to other Data Preprocessing
  17. 17. Feature Engineering Increase predictive power by creating features on raw data • Features closely related to target (predict default –> debt / balance ratio) • Easier interpretation (Date to Year/Month/Day/Hour) • Lag features to “look back” before the date (1, 2,… N days ago) • Rolling aggregates – smoothening over time window • Categorical features • identify discrete features Check Azure team data science process https://docs.microsoft.com/en-gb/azure/machine-learning/team-data-science-process/create-features
  18. 18. Note: All information is encoded in the digital media • Images o Step 1: Colour statistics, EXIF metadata, edges, shapes o Step 2: Extract knowledge in fixed set of numeric characteristics • Text o Step 1: • Bagging, N-grams, term frequency, topic modelling, stemming • Named entity recognition (i.e. Wikipedia) o Step 2: Extract knowledge in fixed set of numeric characteristics Digital Media Feature Engineering
  19. 19. Feature Selection - select the most predictive features ML handles x1000 params Not all params are equal Adding features Common approach to increase accuracy Poor performance Correlated features could lead to poor model performance Overfitting Learning relations in too much details may lead to overfitting
  20. 20. Selecting Good Features • Motivation o Not only prediction but identification of predictive features o Computational costs are related to number of features o Limit external sensors and data sources • Approach o Trying all combinations of features? ( that would be infeasible) • Methods o Forward selection & Backward elimination o Filter - Independent from the ML algorithm o o Embedded – Built-in search for predictive features in ML algorithm o o Wrapper – Measure feature usefulness while ML training
  21. 21. Tuning Model Parameters • Model parameters control inner behaviour o More sophisticated algorithm, more parameters o i.e. Locally Deep SVM with kernel o Kernel type, kernel coefficient • How parameter tuning works? 1. Choose metric for evaluation (AUC - classification, R2-regression, etc.) 2. Select parameters for optimization 3. Define a grid as Cartesian product between arrays 4. For each combination, cross-validate on training set 5. Select the parameters for the best evaluation Note: Expected improvement is 3%-8%
  22. 22. False AlarmsFalse Alarms have serious impact • Degraded confidence in the system • Loss of revenue • Loss of brand image
  23. 23. Performance Metrics • Regression model o Root Mean Squared Error (RMSE) o Coefficient of Determination, R2 ϵ [0;1] • Multi-class classification model o Confusion matrix • Binary classification model o Accuracy based on correct answers o Area under ROC curve (AUC) o Threshold o Precision = TP / (TP + FP) o Recall = TP / (TP + FN) o Cost-Balanced (F1)
  24. 24. Handling Imbalanced Data • Imbalanced: more examples of one class than others (0.001%) • Errors are not the same o Prediction of minority class (failures) is more important o Asymmetric cost (false negative can cost more than false positive) • Compromised performance of standard ML algorithms o For 1% minority class, Accuracy of 99% does not mean useful model o PR-curve is better for imbalanced data • Oversampling o SMOTE – allows better learning o Generate examples combining features of target with features of neighbours
  25. 25. Takeaways Team Data Science Process https://azure.microsoft.com/en-gb/documentation/learning-paths/data-science-process/ ML in the Microsoft World https://docs.microsoft.com/en-us/azure/machine-learning/ Python for AI https://wiki.python.org/moin/PythonForArtificialIntelligence Data Science Blog https://data-flair.training/blogs/category/machine-learning/ Starter Books
  26. 26. Azure ML StudioAzure ML Workbench
  27. 27. Sponsors Gold Sponsors Innovation Sponsor Bronze Sponsors PASS