O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, Delta Lake, and Emerging Tools

Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist.

From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives.

As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms.

In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly.

In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data).

Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021.

You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.

  • Seja o primeiro a comentar

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, Delta Lake, and Emerging Tools

  1. 1. The Future? of Data Science/ML at Scale: A Look at Lakehouse, Delta Lake & MLflow Jules S. Damji Databricks, Inc 2/10/2021 @ UCB iSchool http://dbricks.co/ucbi-webinar @2twitme
  2. 2. Talk Outline § What is the data problem? § What impedes advanced analytics? § Past & present data architecture to look to the future? § What’s “Lakehouse” paradigm § Delta Lake & MLflow
  3. 3. About Cloud data platform for analytics, engineering and data science Runs a fleet of millions of VMs to process exabytes of data/day >7000 customers
  4. 4. The biggest challenges with data today: data quality, staleness , data volume and scale How to grapple data beyond 2020 . . .
  5. 5. Data Analyst Survey 60% reported data quality as top challenge 86% of analysts had to use stale data, with 41% using data that is >2 months old 90% regularly had unreliable data sources over the last 12 months ō ō ō
  6. 6. Data Scientist Survey 75% 51% 42%
  7. 7. Getting high-quality, timely data is hard… but it’s partly a problem of our own making!
  8. 8. The Evolution of Data Management
  9. 9. 1980s: Data Warehouses § ETL data directly from operational database systems § Purpose-built for SQL analytics & BI: schemas, indexes, caching, etc. § Powerful management features such as ACID transactions and time travel ETL Operational Data Data Warehouses BI Reports
  10. 10. 2010s: New Problems for Data Warehouses § Could not support rapidly growing unstructured and semi-structured data: time series, logs, images, documents, etc. § High cost to store large datasets § No support for data science & ML ETL Operational Data Data Warehouses BI Reports
  11. 11. 2010s: Data Lakes § Low-cost storage to hold all raw data (e.g., Amazon S3, HDFS) ▪ $12/TB/month for S3 infrequent tier! § ETL jobs then load specific data into warehouses, possibly for further ELT § Directly readable in ML libraries (e.g., TensorFlow, PyTorch) due to open file format BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Preparation ETL
  12. 12. Problems with Today’s Data Lakes Cheap to store all the data, but system architecture is much more complex! Data reliability suffers: § Multiple storage systems with different semantics, SQL dialects, etc. § Extra ETL steps that can go wrong Timeliness suffers & High Cost: § Extra ETL steps before data is available in data warehouses § Continuous ETL, duplicated storage BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Data Lake Real-Time Database Reports Data Warehouses Data Preparation ETL
  13. 13. Streaming Analytics BI Data Science Machine Learning Structured, Semi-Structured & Unstructured Data Lakehouse Vision Data lake storage for all data Single platform for every use case Management features (transactions, versioning, etc.)
  14. 14. Lakehouse Systems Implement data warehouse management and performance features on top of directly-accessible data in open formats Structured, Semi & Unstructured Data Data Lake Storage BI Data Science Machine Learning Reports Management & Performance Layer ETL Can we get state-of-the-art performance & governance features with this design? Cheap storage in open formats for all data Direct access to data files SQL
  15. 15. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Declarative access for data science & ML
  16. 16. Metadata Layers for Data Lakes § Track which files are part of a table version to offer rich management features like transactions ▪ Clients can then access the underlying files at high speed ▪ Optimistic Concurrency § Implemented in multiple systems: § Examples: ACID Client Application Metadata Layer Data Lake Which files are part of table v1? f1, f2 f3 f1 f2 f3 f4
  17. 17. Example with file1.parquet file2.parquet file3.parquet “events” table _delta_log / v1.parquet / v2.parquet Query: delete all events data about customer #17 file1b.parquet file3b.parquet rewrite rewrite track which files are part of each version of the table (e.g., v2 = file1, file2, file3) _delta_log / v3.parquet atomically add new log file v3 = file1b, file2, file3b Clients now always read a consistent table version! • If a client reads v2 of log, it sees file1, file2, file3 (no delete) • If a client reads v3 of log, it sees file1b, file2, file3b (all deleted) See our VLDB 2020 paper for details
  18. 18. Other Management Features with § Streaming I/O: treat a table as a stream of changes to remove need for message buses like Kafka § INSERT, UPSERT, DELETE & MERGE § Time travel to an old table version § Schema enforcement & evolution § Expectations for data quality CREATE TABLE orders ( product_id INTEGER NOT NULL, quantity INTEGER CHECK(quantity > 0), list_price DECIMAL CHECK(list_price > 0), discount_price DECIMAL CHECK(discount_price > 0 AND discount_price <= list_price) ); spark.readStream .format("delta") .table("events")
  19. 19. Adoption § Already > 50% of Databricks I/O workload (exabytes/day) § Broad industry support Ingest from Query from Store data in
  20. 20. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  21. 21. Lakehouse Engine Optimizations Directly-accessible file storage optimizations can enable high SQL performance: § Caching hot data in RAM/SSD, possibly transcoded § Data layout within files to cluster co-accessed data (e.g., sorting or multi-dimensional clustering,) § Auxiliary data structures like statistics and indexes § Vectorized execution engines for modern CPUs Minimize I/O for cold data, which is the dominant cost Match DWs on hot data New query engines such as Databricks Delta Engine use these ideas
  22. 22. Example: Databricks Delta Engine Vectorized engine for Spark SQL that uses SSD caching, multi-dimensional clustering and zone maps over Parquet files 2996 7143 5793 37283 3302 0 10000 20000 30000 40000 DW1 DW2 DW3 DW4 Delta Engine (on-demand) TPC-DS 30TB Power Test Duration (s)
  23. 23. Key Technologies Enabling Lakehouse 1. Metadata layers for data lakes: add transactions, versioning & more 2. New query engine designs: great SQL performance on data lake storage systems and file formats 3. Optimized access for data science & ML
  24. 24. DS/ML/DL over a Lakehouse § ML frameworks already support reading Parquet, ORC, etc. § New declarative interfaces for I/O enable further optimization § Example: Spark DataFrame API compiles to relational algebra ... model.fit(train_set) Lazily evaluated queryplan Optimized execution using cache, statistics, index, etc users SELECT(kind = “buyer”) PROJECT(date, zip, …) PROJECT(NULL → 0) users = spark.table(“users”) buyers = users[users.kind == “buyer”] train_set = buyers[“date”, “zip”, “price”] .fillna(0) User program client library
  25. 25. Summary Lakehouse systems combine the benefits of data warehouses & lakes while simplifying enterprise data architectures We believe they’ll take over in industry, as most enterprise data is already in lakes Structured, Semi & Unstructured Data BI Data Science Machine Learning Reports
  26. 26. Summary Structured, Semi & Unstructured Data BI Data Science Machine Learning Reports Result: simplify data architectures to improve both reliability & freshness
  27. 27. Machine Learning Development is Complex
  28. 28. Traditional Software vs. Machine Learning § Goal: Meet a functional specification § Quality depends only on code § Typically pick one software stack w/ fewer libraries and tools § Limited deployment environments § Goal: Optimize metric (e.g., accuracy. Constantly experiment to improve it § Quality depends on input data and tuning parameters § Over time data changes; models drift… § Compare + combine many libraries, model § Diverse deployment environments Machine Learning Traditional Software
  29. 29. But Building ML Applications is Complex Data Prep Training Deployment Raw Data ML Engineer Application Developer Data Engineer ▪ Continuous, iterative process ▪ Dependent on reliable data ▪ Constantly update data & metrics ▪ Many teams and systems involved μ λ θ Tuning Scale μ λ θ Tuning Scale Scale Scale Model Exchange Governance
  30. 30. : An Open-Source ML Platform Experiment management Model management Reproducible runs Model packaging and deployment T R A C K I N G P R O J E C T S M O D E L R E G I S T R Y M O D E L S Training Deployment Raw Data Data Prep ML Engineer Application Developer Data Engineer Any Language Any ML Library
  31. 31. Key Concepts in MLflow Tracking Parameters: key-value inputs to your code Metrics: numeric values (can update over time) Tags and Notes: information about a run Artifacts: files, data, and models Source: what code ran? Version: what of the code? Run: an instance of code that runs by MLflow Experiment: {Run, … Run}
  32. 32. $ mlflow ui Model Development with MLflow is Simple! import mlflow data = load_text(file_name=file) ngrams = extract_ngrams(data, N=n) model = train_model(ngrams, learning_rate=lr) score = compute_accuracy(model) with mlflow.start_run(): mlflow.log_param(“data_file”, file) mlflow.log_param(“n”, n) mlflow.log_param(“learn_rate”, lr) mlflow.log_metric(“score”, score) mlflow.sklearn.log_model(model) Track parameters, metrics, artifacts, output files & code version Search using UI or API
  33. 33. Tracking for ML Experiments Easily track parameters, metrics, and artifacts in popular ML libraries Library integrations:
  34. 34. MLflow Components Tracking Record and query experiments: code, data, config, and results Projects Package data science code in a format that enables reproducible runs on many platform Models Deploy machine learning models in diverse serving environments Model Registry Store, annotate and manage models in a central repository mlflow.org github.com/mlflow twitter.com/MLflow databricks.com /mlflow
  35. 35. Project Spec Code Data Config Local Execution Remote Execution MLflow Projects Dependencies
  36. 36. MLflow Components Tracking Record and query experiments: code, data, config, and results Projects Package data science code in a format that enables reproducible runs on any platform Models Deploy machine learning models in diverse serving environments Model Registry Store, annotate and manage models in a central repository mlflow.org github.com/mlflow twitter.com/MLflow databricks.com /mlflow
  37. 37. Model Format Flavor 2 Flavor 1 ML Frameworks Inference Code Batch & Stream Scoring Serving Tools Standard for ML models MLflow Models
  38. 38. Model Flavors Example model = mlflow.pyfunc.load_model(model_uri) model.predict(pandas.input_dataframe) ….
  39. 39. MLflow Components Tracking Record and query experiments: code, data, config, and results Projects Package data science code in a format that enables reproducible runs on any platform Models Deploy machine learning models in diverse serving environments Model Registry Store, annotate and manage models in a central repository mlflow.org github.com/mlflow twitter.com/MLflow databricks.com /mlflow
  40. 40. The Model Management Problem When you work in a large organization with many models, many data teams, management becomes a major challenge: • Where can I find the best version of this model? • How was this model trained? • How can I track docs for each model? • How can I review models? • How can I integrate with CI/CD? MODEL DEVELOPER REVIEWER MODEL USER ???
  41. 41. Automated Jobs REST Serving Downstream Users Reviewers + CI/CD Tools Staging Production Archived Model Registry Data Scientists Deployment Engineers Parameters Metrics Artifacts Models Metadata Tracking Server Model Registry VISION: Centralized and collaborative model lifecycle management
  42. 42. MLflow Model Registry • Repository of named, versioned models with controlled Access to Models • Track each model’s stage: none, staging, production, or archived • Easily inspect a specific version and its run info • Easily load a specific version • Provides model description, lineage and activities
  43. 43. Model Registry Workflow API Model Registry MODEL DEVELOPER DOWNSTREAM USERS AUTOMATED JOBS REST SERVING REVIEWERS, CI/CD TOOLS mlflow.register_model(model_uri,"WeatherForecastModel") mlflow.sklearn.log_model(model, artifact_path=”sklearn_model”, registered_model_name= “WeatherForecastModel”) client = mlflow.tracking.Mlflowclient() client.transition_model_version_stage(name=”WeatherForecastModel”, version=5, stage="Production") model_uri= "models:/{model_name}/production".format( model_name="WeatherForecastModel") model_prod = mlflow.sklearn.load_model(model_uri) model_prod.predict(data)
  44. 44. Databricks Webhooks allow setting callbacks on registry events like stage transitions to run CI/CD tools Model Registry: Webhooks j u s t l a u n c h e d Staging Production Archived Data Scientists Deployment Engineers Model Registry Human Reviewers CI/CD Tools Batch Scoring Real-time Serving v2 v3 v1 VERSION_REGISTERED: MyModel, v2 TRANSITION_REQUEST: MyModel, v2, Staging→Production TAG_ADDED: MyModel, v2, BacktestPassed
  45. 45. MLflow Model Registry Recap • Central Repository: Unique named registered models for discovery across data teams • Model Registry Workflow: Provides UI and API for registry operations • Model Versioning: Allow multiple versions of model in different stages • Model Stages: Allow stage transition: none, staging, production, or archived • CI/CD Integration: Easily load a specific version for testing and inspection and webbooks for events notification • Model Lineage: Provides model description, lineage and activities Staging Production Archived Model Registry Data Scientists Deployment Engineers
  46. 46. Summary § Lakehouse systems combine the benefits of data warehouses & lakes while simplifying enterprise data architectures § With simplified architecture and use of Delta Lake as part of Lakehouse storage layer and MLflow help to scale Advanced Analytics workloads § Other tools include Koalas (scalable EDA)
  47. 47. Learn More § Download and learn Delta Lake at delta.io § Download and learn MLflow at mlflow.org § Download and learn Koalas at Koalas GitHub
  48. 48. Resources & Fun to Read § Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics § What is Lakehouse and why § What is Delta Lake and why § What is MLflow and Why § We don’t need data scientists, we need data engineers § Data Science is different now
  49. 49. Thank you! J Q & A jules@databricks.com @2twitme https://www.linkedin.com/in/dmatrix/

×