O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Spark Summit Keynote by Suren Nathan

Carregando em…3

Confira estes a seguir

1 de 34 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Spark Summit Keynote by Suren Nathan (20)


Mais de Spark Summit (20)

Mais recentes (20)


Spark Summit Keynote by Suren Nathan

  1. 1. Data Profiling and Pipeline Processing with Spark – A Journey Suren Nathan Synchronoss
  2. 2. (Q3’2014 revenue) Who am I • Sr. Director Big Data Platform and Analytics Framework at Synchronoss • CTO at Razorsight (acquired by Synchronoss) • Worked in Analytics and decision support systems for more than 15 years • Passionate about solving business problems leveraging latest technology
  3. 3. (Q3’2014 revenue) Synchronoss provides Personal Cloud and Activation Platforms to Tier One Operators, MSO’s and Enterprises around the globe
  4. 4. Mobile Content Transfer Personal Cloud Device Activation Cloud Account Provisioning On-Boarded Welcome Synchronoss Integrated Cloud Products
  5. 5. Online and Device ACTIVATION Back-up, Sync and Share ACTIVATION CLOUD Internet of Things Integrated Life (Q3’2014 revenue) Synchronoss Connects Operators to their Customers
  6. 6. Big Data @ Synchronoss Sample numbers @ one tier1 customer: • 30M registered users • 14M monthly active users • 8M daily active users • Up to 215TB of ingest per day • 62PB of content stored • 50 Billion user content files • Ingest of 1PB per week • 4+ Star Rating Apps
  7. 7. What do we do? • Big Data Analytics Platform Group • Implement scalable big data technology platform to help deliver consistent analytics • Platform deployed in private cloud and AWS
  8. 8. Data Pipeline Process Ingest Data Profile Data Parse Data Transform Data Enrich Data Aggregate Data - Perform Analysis - Load Index Store - Feed EDW
  9. 9. Our Data Pipeline Journey
  10. 10. Data Pipeline – V1 Staging ETL EDW ETL Process Centric ETL Source Data EDW Multiple Custom ETLs separated from data layer SMP architecture not distributed Long running batch workloads Contention, Bottlenecks with increased data volume No support for unstructured data Cannot retain historical data $$$ >1 YEAR Inflexible
  11. 11. Data Pipeline – V2 Staging ETL EDW ETL Process Centric ETL Source Data EDW  ETLs closer to data  High performance, but expensive  Batch workloads, with reduced latencies  Unable to handle unstructured data  Storage costs prohibitive $$$$ 6 Months+ Still Inflexible MPP Appliance
  12. 12. Data Pipeline – V3 Option Skipped Source Data  Did not foresee a huge improvement  Batch workloads only  Slow performance with MapReduce  Lack of resources and skills gap  Lack of consistency  Too many tools $$ 1 year + Risks
  13. 13. Data Pipeline – V4 Source Data ETLs closer to data Batch and stream workloads Superior performance Abstracted features via Framework Components and standards Multiple language support Simplified design $ <1 Month Highly Flexible
  14. 14. Data Profiling Put all the data in the lake man What’s in these data sets? More data is better. Work with the population and not a sample -- Data Scientists
  15. 15. Why Data Profiling? • Find out what is in the data • Get metrics on data quality • Assess the risk involved in creating business rules • Discover metadata including value patterns and distributions • Understanding data challenges early to avoid delays and cost overruns • Improve the ability to search the data
  16. 16. Analysts spend 80-90% of time in data munging Current approaches require multiple manual touch point and processes Lost opportunity due to lengthy project time frames Business Challenge
  17. 17. Typical Scenario Data size too large to view using excel & notepad Data has to be loaded into database for profiling Cannot load into database unless the data fields are known File formats are not right and specifications are incorrect Distribution, space, multiple touch points, moving files here and there Too many dependencies, wasted time
  18. 18. What do we need? Speed, Agility & Automation
  19. 19. Data Profiler Requirements Profile data from data lake Validate and Preview data Review statistics Create Meta Data Create Downstream Schema
  20. 20. Spark to the rescue  Check the Types  Check the Values  Calculate metrics  Generate MetaData RDD …. C1 C2 C3 C4 Cn RDDsData Files  Dynamic build execution graph  Map-> Map  Built in transformations (unique, get first etc.,)  In memory execution provides speed
  21. 21. Execution Flow and Software Stack Repository Data Lake location for data Data Profiler UI 3 Spark Data Profiler 1 2 4 5 6.1 6.3 7 6.2 8 MapR FS (M7) Spark Spark Monitoring UI Spark Data Profiler MapR UI MapR Cluster Hardware Infrastructure Level OS/File System Level Razorsight Application Level System Application Level Legend: NFS Meta Data Repository WEB Server Data Profiler UI
  22. 22. Univariate Statistics Outputs for Numeric Values Outputs for Non-Numeric Values Histograms Count of Missing Values Count of Non- Missing Values Mean Variance Standard Deviation Minimum Maximum Range Mode Median Q1 Value Q3 Value Interquartile Range Skewness Kurtosis
  23. 23. Data Profiler Web Application
  24. 24. Meta Data and DDL
  25. 25. Advantages • Source data in data lake • All profiling done in the data lake • No manual movement of data • Profile sample or full data set • Integrate creation of meta data for transformation, enrichment • Send clean data to downstream processes
  26. 26. Results • Improved data analysis time from weeks to hours • Average improvement of data pipeline process 80% • Identified data quality issues well ahead of time • Empowered business analysts to perform the work
  27. 27. Secure Repository Data Health | Cleansing | Pruning | Transformation |Univariate Analysis Descriptive | Predictive | Bivariate | Multivariate RESTful | SOA Dashboards | Adhoc Queries | KPIs | Alerts Data Ingestion Data Lake Data Preparation Data Analytics Data Services Data Visualization Layer 1 Infrastructure Layer 2 Data Management Layer 3 Modeling Layer 4 Integration Layer 5 Business Insight and Actions Structured | Unstructured | Batch | Streaming SFT P NDM Nwk Path Social Media StreamEmail Framework Layers
  28. 28. Framework Components Ingestion Multiple source channels Batch/Real Time Data Validation Compression/Encryption Profiling Data Health Check Summary Statistics Scrubbing/Cleansing Meta Data Creation Parsing Fixed Width Delimited Mapping Transformation Enrichment Truncation Imputation Aggregation Integration Batch RESTful Database Web Portal Meta Data Configuration Tracking Alerts Dashboard
  29. 29. Framework Architecture Processing Components Data Storage Layer Data Aggregator Data Parser & Transformer Elastic Search Loader DB Loader Data Reconciliation Orchestration Layer Elastic Search XDF Web UI Data Profiler MySQL Meta-data Repository Control Flow Data Flow Data Partitioner Synchronoss Data Lake Data Ingestion Data Beacon External Data Sources Bivariate Engine Data Prep Engine SQL Engine
  30. 30. Framework Technology Stack MapR FS (M7) Scoop Apache Spark Hadoop MapR Cluster Hardware Infrastructure Level OS/File System Level System Application Level NFS UI/Control Cluster Oozie Apache Drill Tomcat Active MQ Spring Integration HUE ElasticSearch Cluster NFS ElasticSearch Engine Angular REST Unix/Linux Unix/Linux Unix/Linux
  31. 31. What’s Next? • Bivariate Analysis • Multicollinearity Outputs for Numeric Values (by target value for each variable) Correlation Outputs Record Count Row Count Percent Average Variance Standard Deviation Skewness Kurtosis Minimum Maximum Pearson’s Correlation Coefficient Spearman’s Correlation Coefficient Covariance Variable Clustering Regression Coefficients Dendogram Hierarchical Cluster (HCA) Correlation Matrix Variance Inflation Factor (VIF)
  32. 32. Lessons • Let business value drive technology adoption • Plan incremental updates • Pay attention to hidden costs • Simplify • Implement Framework based development • Leverage existing skillset to scale
  33. 33. Simplify
  34. 34. THANK YOU. Suren.nathan@Synchronoss.com

Notas do Editor

  • How can data be used?
    Does the data conform to the structure, standards or patterns
    c. challenges of joins and integration
    d. Identify key candidates, foreign-key candidates, and functional dependencies
    e. Identify enrichment rules for better search by assigning it to a category