SlideShare uma empresa Scribd logo
1 de 70
Data + Science = DataScience
P r e s e n t e d b y :
eRic Choo
Scaling up with Cisco Big Data
Big Data Products-Solutions Stack
Infrastructure - Servers, Storage, Data Protection & Retention Solutions
Business Intelligence
Data Mining & Business Analytics
Big Data Virtualization & Systems Integration
What you will be hearing
WHAT
AND WHY?
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in
UC Berkeley’s AMPLab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
• Fast Growing Community
What is Spark?
Spark
Streaming
batches of X seconds
IoT live data stream
processed results
Understand
ExploreModel
Assess
Data
Science
Hadoop
• MapReduce is powerful, but hard
• Spark aims to be both powerful and easy for processing
• How does it do it?
– A more generalized form of MapReduce
– Elements transformed in parallel
– Memory Cache-ing
– Supports Python & Scala, along with Java
What is Spark? An Execution Engine on Top of Hadoop
Map ReduceInput Output
Reduce
Input
Output
Spark advantages for the end user
Faster Development & Data Pipelining
• Simple, easy-to-understand programming
abstraction with an interactive shell
• APIs for Java, Python and Scala
• Enables reuse of code across batch,
interactive and streaming applications
e.g. calling machine learning library
routines in Spark SQL
In-Memory Performance
• General-purpose execution graphs
• In-memory pipelining to achieve maximum
performance without persisting
intermediate results to disk
Popular use cases include ETL, Machine Learning and Real-time Analytics
Easy to Develop Applications – Example
2-5x less code
Hadoop with Speed Advantages - Example
Logistic regression in Hadoop
MapReduce and Hadoop with Spark
Hadoop MR
Hadoop w/ Spark
Up to 10x faster on disk,
100x faster in memory
MAPR SUPPORT FOR SPARK
MapR –Integration and Support of Apache Spark Stack
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Spark
Streaming
Storm
StreamingNoSQL &
Search
Juju
Provisioning
&
Coordination
Sahara
ML, Graph
Mahout
MLLib
GraphX
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Pig
Cascading
Spark
Batch
MapReduce
v1 & v2
Tez
HBase
Solr
Hive
Impala
Spark SQL
Drill
SQL
Sentry Oozie ZooKeeperSqoop
Flume
Data
Integration
& Access
HttpFS
Hue
Data PlatformMapR-FS MapR-DB
Management
Spark Stack Offers Variety of Functionality…
Spark SQL
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine
learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Spark on MapR Advantages
World-record performance on disk coupled
with in-memory processing advantages
High Performance
Industry-leading enterprise-grade features for
the Spark stack
Enterprise-grade Applications
Strategic partnership with Databricks to
ensure enterprise support for the entire stack
24/7 Best-in-class Global Support
MapR-DB + Spark on one Hadoop cluster
allows for real-time as-it-happens analytics
Operational DataStore + Spark
SPARK USE CASES
Cisco: Security Intelligence Operations
Sensor data lands in MapR
Spark Streaming on MapR for
first check on known threats
Data next processed on GraphX
and Mahout
Additional SQL querying done
via Spark SQL and Impala
Complex
Data Pipelining
without MapReduce
Industry Leading Ad-Targeting Platform:
Real-time Decisions
High performance analytics
over MapR-DB
Load from MapR-DB table into
RDD to augment scoring
Results stored back in MapR-DB
for other applications
Real-time Analytics
over NoSQL
Addressing Health
Care Regulations
Patient information in MapR-DB
combined with clinical records to
compute re-admittance
probability
Process uses Spark with
transactional data in MapR-DB
Deploy home health services to
prevent re-admittance
Real-time Analytics
over NoSQL
Streaming Use Cases
• Manufacturing & Internet of Things: Real-time, adaptive analysis of machine data (e.g.,
sensors, control parameters, alarms, notifications, maintenance logs, and imaging results)
from industrial systems (e.g., equipment, plant, fleet) for visibility into asset health, proactive
maintenance planning, and optimized operations.
• Fraud Management: Real-time analysis of business communication and accounting
transactions to detect unusual activities.
• Marketing & Sales: Analysis of customer engagement and conversion, powering real-time
recommendations while customers are still on the site or in the store
• Customer Service & Billing: Analysis of contact center interactions, enabling accurate
remote trouble shooting before expensive field technicians are dispatched
• Information Technology: Log processing to detect unusual events occurring in stream(s) of
data, so that IT can take remedial action before service quality degrades
Real-time Analytics
over Streaming
SPARK DEMO
DATA SCIENCE
Data Science
• What is Data Science
– Extraction of knowledge from data
employing math, statistics and information
theory (Probability model, machine learning
and etc.)
Source: Wikipedia
Data Analytics/Science Development Cycle
Challenges
• Data Science knowledge
required
• Multiple models for testing
• Multiple ways of tuning testing
data
• Multiple iterations of testing
• Stabilizing results
Benefits of Automation
• Data Science knowledge built into
platform
• Automated testing of multiple
models
• Selection of most accurate models
• Reduced iterative testing time
• Effective use of Data Science
Resources
• Higher productivity and lower
cost
Basic Data Science Categories
• Supervised Learning • Unsupervised Learning
Supervised Learning
• Labelling of data according to a labelled training set
• Example
– I know that it will rain when
• Sky is dark
• More moisture in the air
• Its is near raining session
– Question:
• In the current weather will it rain
• Type of algorithms
– Naive Bayes
– Linear Regression
– Decision Trees
Unsupervised Learning
• Example:
– I have a set of data collected regarding weather
– I have multiple other set of data that are non
related to the weather. ie. forest fire data from
nearby region, etc.
– Are there any relation between the data set?
• Type of algorithms
– K-mean
– Fuzzy Clustering
DATA SCIENCE USE CASES
TRAFFIC ANALYTICS
Traffic Analytics
TEXT ANALYTICS
Text Analytics
CLUSTER DOCUMENTS
Hadoop
Text Documents
MAHOUT (Data Science Tool)
MapReduce
Text Analytics
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
Hadoop
Text Documents
MAHOUT (Data Science Tool)
MapReduce
Categorizing into Topics/Stories
CLUSTER 1 CLUSTER 2
CLUSTER 3 CLUSTER 4
CONSTRUCT STORIES
TOP TERMS CL 1
Technology
3D Printing
Steve Jobs
Sports Wear
…
TOP TERMS CL 2
United Nations
Dogs
Camera
Internet of Things
…
CATEGORY : INNOVATION CATEGORY : SECURITY
Term Document Matrix
Word1
Word2
Word3
Word4
Word5
Word6
Word7
Word8
Word9
Word
10
Word
11
FILE 1
FILE 2
FILE 3
FILE 4
FILE 5
FILE 6
FILE 7
Cluster 1:
FILE 1
FILE 6
Cluster 2:
FILE 2
FILE 5
Cluster 3:
FILE 3
FILE 4
Cluster 4:
FILE 7
SENTIMENT ANALYTICS
Sentiment Analysis
TWITTER (DATA IN JSON FORMAT)
Field Value
For Country United States
By individual State
Analyze Tweets
Objective : To find out the level of happiness of a State in USA
Sentiment Dictionary
Sentiment Dictionary
AFINN-111
Sentiment Score Computation
San Francisco
Los Angeles
New York
Chicago
Boston
San Diego
Score at tweet level for CA
Score at tweet level for CA
Score at tweet level for CA
Summing
up the
tweet level
scores for
each state
Results in Sentiment Analysis
Happy States
Unhappy States
Results in an example of Simple Visualization
Results in an example of Complex Visualization
Decision with Analytics Support
DECISION
Hadoop
Social Media
Data
Text AnalyticsData Science
SQLStructuredQueryLanguage
DATA SCIENCE AUTOMATION
DEMO
Data Science Automation
DataRobot is a platform that lets Data Scientist automates the entire model life
cycle process which is very serialized and time consuming. This life cycle
includes:
1. Pre-processing and feature engineering
2. Algorithm identification to build predictive model(s)
3. Training, testing, and validating of models
4. Building of deployment scripts for model deployment to provide business
insight
CISCO – MAPR DATA
ANALYTICS USE CASE
Quantium captures new niche in data analytics market
MapR Distribution for Apache Hadoop and
Cisco UCS cut query time by 92 percent,
improve accuracy of results
“ With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our
competitors for the foreseeable future.” https://marketplace.cisco.com/catalog/products/3344
- Alex Shaw, Head of Technology Operations, Quantium
Hosted on Cisco infrastructure, MapR
Distribution for Hadoop meets Quantium’s strict
requirements
To meet its challenges, Quantium assembled a team of data scientists from across the business. The team created a set of requirements
and evaluated the available software and hardware solutions on the market.
“Decisions about the new platform would affect Quantium’s business for years to come, so we invested a significant amount of time
and money in the selection process,”
- Alex Shaw, Quantium’s Head of Technology Operations
“The POC demonstrated that MapR performs better than the competition. The
MapR file system gives us maximum control over how we store information within
the data volumes and has good security features.”
- Alex Shaw, Quantium’s Head of Technology Operations
• Quantium realized that a big data solution was needed, not only because
of the data volume but also the heavy analytical requirements.
• While the team chose Hadoop as the big
data software solution, they still needed
to choose the best distribution from
among the top-tier Hadoop vendors (see
figure 1).
• The first stage of the process, a thorough
analysis of features and benefits,
narrowed the field to MapR and one
other competitor.
• Performance of new platform exceeds targets
• Unique business model outpaces competitors
• Greater innovation, shorter time to market
“Having access to external data sets to combine with
our clients’ data distances us from everybody else in
this space,”
“We have a lot of smart people who have been
hamstrung by technology and its ability to implement
their ideas. Now they have improved ways of executing
analytics which opens up the ability to create new and
innovative solutions for our clients”
- Alex Shaw, Quantium’s Head of Technology Operations
• Scaling to accommodate business growth
• Multi-tenancy model safeguards client information
“MapR incorporates data partitioning
via the Volumes feature, which allows us
to logically segregate individual data
sets while optimizing data storage for
optimum performance,”
- Alex Shaw, Quantium’s Head of Technology
Operations
Extending the Quantium approach to new
markets
“We’ve expanded the range of problems that we can
solve, enabling our clients to grow their business by
interacting with each of their customers as individuals
with specific wants and needs,”
“With the Cisco-MapR platform, Quantium has
positioned itself to stay well ahead of our competitors
for the foreseeable future.”
- Alex Shaw, Quantium’s Head of Technology Operations
WORLD’S LARGEST BIOMETRIC IDENTITY
SYSTEM: AADHAAR EXPERIENCE
World's Largest Biometric Identity System: Aadhaar Experience
• 1.2 billion residents
– 640,000 villages, ~60% under $2/day, ~75% literacy,
– <3% pays Income Tax, <20% banking,
– ~1 billion mobile connections
– ~300-400m migrant workers
• $50 billion direct subsidies every year!
– Residents have no standard verifiable identity
– Most programs plagued with ghost and multiple
identities causing leakage of 20-40%
Demographic Data
• Compulsory data:
– Name, Age/Date of Birth,
Gender and
– Address of the resident
• Optional data:
– Mobile number
– Email address
Biometric Data
Photograph
All 10
fingerprints
Both Iris
World's Largest Biometric Identity System: Aadhaar Experience
12-digit Aadhaar Number
Unique, lifetime, biometric based identity
World's Largest Biometric Identity System: Aadhaar Experience
Concluding Spark?
Spark
Streaming
batches of X seconds
IoT live data stream
processed results
Hadoop
Understand
ExploreModel
Assess
Data
Science
Big Data Implementation Road Map
PLAN BUILD MANAGE
Understand
ExploreModel
Assess
Discovery
Workshop
Proof of
Concept
Validation
Plan, Design,
Implement
Support /
Managed
Services
Please take some time to fill up the
feedback form and the Question Sheet

Mais conteúdo relacionado

Mais procurados

Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
Allen Day, PhD
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
DataWorks Summit
 
Leveraging advanced technologies to support critical applications in a secure...
Leveraging advanced technologies to support critical applications in a secure...Leveraging advanced technologies to support critical applications in a secure...
Leveraging advanced technologies to support critical applications in a secure...
DataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 

Mais procurados (20)

Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformBest Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data Platform
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Leveraging advanced technologies to support critical applications in a secure...
Leveraging advanced technologies to support critical applications in a secure...Leveraging advanced technologies to support critical applications in a secure...
Leveraging advanced technologies to support critical applications in a secure...
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 

Destaque

Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Spark Summit
 

Destaque (20)

Indexing thousands of writes per second with redis
Indexing thousands of writes per second with redisIndexing thousands of writes per second with redis
Indexing thousands of writes per second with redis
 
Greenplum- an opensource
Greenplum- an opensourceGreenplum- an opensource
Greenplum- an opensource
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Data science
Data scienceData science
Data science
 
Creating a contemporary risk management system using python (dc)
Creating a contemporary risk management system using python (dc)Creating a contemporary risk management system using python (dc)
Creating a contemporary risk management system using python (dc)
 
DataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetupDataScience and BigData Cebu 1st meetup
DataScience and BigData Cebu 1st meetup
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
The Role of Data Science in Enterprise Risk Management, Presented by John Liu
The Role of Data Science in Enterprise Risk Management, Presented by John LiuThe Role of Data Science in Enterprise Risk Management, Presented by John Liu
The Role of Data Science in Enterprise Risk Management, Presented by John Liu
 
Fiche Produit Verteego Data Suite, mars 2017
Fiche Produit Verteego Data Suite, mars 2017Fiche Produit Verteego Data Suite, mars 2017
Fiche Produit Verteego Data Suite, mars 2017
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming by Ew...
 
Data Visualisation for Data Science
Data Visualisation for Data ScienceData Visualisation for Data Science
Data Visualisation for Data Science
 
Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
VU University Amsterdam - The Social Web 2016 - Lecture 4
VU University Amsterdam - The Social Web 2016 - Lecture 4VU University Amsterdam - The Social Web 2016 - Lecture 4
VU University Amsterdam - The Social Web 2016 - Lecture 4
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
 

Semelhante a Scaling up with Cisco Big Data: Data + Science = Data Science

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 

Semelhante a Scaling up with Cisco Big Data: Data + Science = Data Science (20)

First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics EcosystemXDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
XDF 2019 Xilinx Accelerated Database and Data Analytics Ecosystem
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 

Último

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Scaling up with Cisco Big Data: Data + Science = Data Science

  • 1. Data + Science = DataScience P r e s e n t e d b y : eRic Choo Scaling up with Cisco Big Data
  • 2. Big Data Products-Solutions Stack Infrastructure - Servers, Storage, Data Protection & Retention Solutions Business Intelligence Data Mining & Business Analytics Big Data Virtualization & Systems Integration
  • 3. What you will be hearing
  • 5. Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMPLab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation • Fast Growing Community
  • 6. What is Spark? Spark Streaming batches of X seconds IoT live data stream processed results Understand ExploreModel Assess Data Science Hadoop
  • 7. • MapReduce is powerful, but hard • Spark aims to be both powerful and easy for processing • How does it do it? – A more generalized form of MapReduce – Elements transformed in parallel – Memory Cache-ing – Supports Python & Scala, along with Java What is Spark? An Execution Engine on Top of Hadoop Map ReduceInput Output Reduce Input Output
  • 8. Spark advantages for the end user Faster Development & Data Pipelining • Simple, easy-to-understand programming abstraction with an interactive shell • APIs for Java, Python and Scala • Enables reuse of code across batch, interactive and streaming applications e.g. calling machine learning library routines in Spark SQL In-Memory Performance • General-purpose execution graphs • In-memory pipelining to achieve maximum performance without persisting intermediate results to disk Popular use cases include ETL, Machine Learning and Real-time Analytics
  • 9. Easy to Develop Applications – Example 2-5x less code
  • 10. Hadoop with Speed Advantages - Example Logistic regression in Hadoop MapReduce and Hadoop with Spark Hadoop MR Hadoop w/ Spark Up to 10x faster on disk, 100x faster in memory
  • 11.
  • 13. MapR –Integration and Support of Apache Spark Stack APACHE HADOOP AND OSS ECOSYSTEM Security YARN Spark Streaming Storm StreamingNoSQL & Search Juju Provisioning & Coordination Sahara ML, Graph Mahout MLLib GraphX EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Pig Cascading Spark Batch MapReduce v1 & v2 Tez HBase Solr Hive Impala Spark SQL Drill SQL Sentry Oozie ZooKeeperSqoop Flume Data Integration & Access HttpFS Hue Data PlatformMapR-FS MapR-DB Management
  • 14. Spark Stack Offers Variety of Functionality… Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  • 15. Spark on MapR Advantages World-record performance on disk coupled with in-memory processing advantages High Performance Industry-leading enterprise-grade features for the Spark stack Enterprise-grade Applications Strategic partnership with Databricks to ensure enterprise support for the entire stack 24/7 Best-in-class Global Support MapR-DB + Spark on one Hadoop cluster allows for real-time as-it-happens analytics Operational DataStore + Spark
  • 16.
  • 18. Cisco: Security Intelligence Operations Sensor data lands in MapR Spark Streaming on MapR for first check on known threats Data next processed on GraphX and Mahout Additional SQL querying done via Spark SQL and Impala Complex Data Pipelining without MapReduce
  • 19. Industry Leading Ad-Targeting Platform: Real-time Decisions High performance analytics over MapR-DB Load from MapR-DB table into RDD to augment scoring Results stored back in MapR-DB for other applications Real-time Analytics over NoSQL
  • 20. Addressing Health Care Regulations Patient information in MapR-DB combined with clinical records to compute re-admittance probability Process uses Spark with transactional data in MapR-DB Deploy home health services to prevent re-admittance Real-time Analytics over NoSQL
  • 21. Streaming Use Cases • Manufacturing & Internet of Things: Real-time, adaptive analysis of machine data (e.g., sensors, control parameters, alarms, notifications, maintenance logs, and imaging results) from industrial systems (e.g., equipment, plant, fleet) for visibility into asset health, proactive maintenance planning, and optimized operations. • Fraud Management: Real-time analysis of business communication and accounting transactions to detect unusual activities. • Marketing & Sales: Analysis of customer engagement and conversion, powering real-time recommendations while customers are still on the site or in the store • Customer Service & Billing: Analysis of contact center interactions, enabling accurate remote trouble shooting before expensive field technicians are dispatched • Information Technology: Log processing to detect unusual events occurring in stream(s) of data, so that IT can take remedial action before service quality degrades Real-time Analytics over Streaming
  • 22.
  • 24.
  • 25.
  • 27. Data Science • What is Data Science – Extraction of knowledge from data employing math, statistics and information theory (Probability model, machine learning and etc.)
  • 28. Source: Wikipedia Data Analytics/Science Development Cycle Challenges • Data Science knowledge required • Multiple models for testing • Multiple ways of tuning testing data • Multiple iterations of testing • Stabilizing results Benefits of Automation • Data Science knowledge built into platform • Automated testing of multiple models • Selection of most accurate models • Reduced iterative testing time • Effective use of Data Science Resources • Higher productivity and lower cost
  • 29. Basic Data Science Categories • Supervised Learning • Unsupervised Learning
  • 30. Supervised Learning • Labelling of data according to a labelled training set • Example – I know that it will rain when • Sky is dark • More moisture in the air • Its is near raining session – Question: • In the current weather will it rain • Type of algorithms – Naive Bayes – Linear Regression – Decision Trees
  • 31. Unsupervised Learning • Example: – I have a set of data collected regarding weather – I have multiple other set of data that are non related to the weather. ie. forest fire data from nearby region, etc. – Are there any relation between the data set? • Type of algorithms – K-mean – Fuzzy Clustering
  • 32.
  • 36.
  • 38. Text Analytics CLUSTER DOCUMENTS Hadoop Text Documents MAHOUT (Data Science Tool) MapReduce
  • 39. Text Analytics CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 Hadoop Text Documents MAHOUT (Data Science Tool) MapReduce
  • 40. Categorizing into Topics/Stories CLUSTER 1 CLUSTER 2 CLUSTER 3 CLUSTER 4 CONSTRUCT STORIES TOP TERMS CL 1 Technology 3D Printing Steve Jobs Sports Wear … TOP TERMS CL 2 United Nations Dogs Camera Internet of Things … CATEGORY : INNOVATION CATEGORY : SECURITY
  • 41. Term Document Matrix Word1 Word2 Word3 Word4 Word5 Word6 Word7 Word8 Word9 Word 10 Word 11 FILE 1 FILE 2 FILE 3 FILE 4 FILE 5 FILE 6 FILE 7 Cluster 1: FILE 1 FILE 6 Cluster 2: FILE 2 FILE 5 Cluster 3: FILE 3 FILE 4 Cluster 4: FILE 7
  • 42.
  • 44. Sentiment Analysis TWITTER (DATA IN JSON FORMAT) Field Value For Country United States By individual State Analyze Tweets Objective : To find out the level of happiness of a State in USA
  • 46. Sentiment Score Computation San Francisco Los Angeles New York Chicago Boston San Diego Score at tweet level for CA Score at tweet level for CA Score at tweet level for CA Summing up the tweet level scores for each state
  • 47. Results in Sentiment Analysis Happy States Unhappy States
  • 48. Results in an example of Simple Visualization
  • 49. Results in an example of Complex Visualization
  • 50. Decision with Analytics Support DECISION Hadoop Social Media Data Text AnalyticsData Science SQLStructuredQueryLanguage
  • 51.
  • 53. Data Science Automation DataRobot is a platform that lets Data Scientist automates the entire model life cycle process which is very serialized and time consuming. This life cycle includes: 1. Pre-processing and feature engineering 2. Algorithm identification to build predictive model(s) 3. Training, testing, and validating of models 4. Building of deployment scripts for model deployment to provide business insight
  • 54.
  • 55. CISCO – MAPR DATA ANALYTICS USE CASE
  • 56. Quantium captures new niche in data analytics market MapR Distribution for Apache Hadoop and Cisco UCS cut query time by 92 percent, improve accuracy of results “ With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our competitors for the foreseeable future.” https://marketplace.cisco.com/catalog/products/3344 - Alex Shaw, Head of Technology Operations, Quantium
  • 57. Hosted on Cisco infrastructure, MapR Distribution for Hadoop meets Quantium’s strict requirements To meet its challenges, Quantium assembled a team of data scientists from across the business. The team created a set of requirements and evaluated the available software and hardware solutions on the market. “Decisions about the new platform would affect Quantium’s business for years to come, so we invested a significant amount of time and money in the selection process,” - Alex Shaw, Quantium’s Head of Technology Operations
  • 58. “The POC demonstrated that MapR performs better than the competition. The MapR file system gives us maximum control over how we store information within the data volumes and has good security features.” - Alex Shaw, Quantium’s Head of Technology Operations • Quantium realized that a big data solution was needed, not only because of the data volume but also the heavy analytical requirements. • While the team chose Hadoop as the big data software solution, they still needed to choose the best distribution from among the top-tier Hadoop vendors (see figure 1). • The first stage of the process, a thorough analysis of features and benefits, narrowed the field to MapR and one other competitor.
  • 59. • Performance of new platform exceeds targets • Unique business model outpaces competitors • Greater innovation, shorter time to market “Having access to external data sets to combine with our clients’ data distances us from everybody else in this space,” “We have a lot of smart people who have been hamstrung by technology and its ability to implement their ideas. Now they have improved ways of executing analytics which opens up the ability to create new and innovative solutions for our clients” - Alex Shaw, Quantium’s Head of Technology Operations
  • 60. • Scaling to accommodate business growth • Multi-tenancy model safeguards client information “MapR incorporates data partitioning via the Volumes feature, which allows us to logically segregate individual data sets while optimizing data storage for optimum performance,” - Alex Shaw, Quantium’s Head of Technology Operations
  • 61. Extending the Quantium approach to new markets “We’ve expanded the range of problems that we can solve, enabling our clients to grow their business by interacting with each of their customers as individuals with specific wants and needs,” “With the Cisco-MapR platform, Quantium has positioned itself to stay well ahead of our competitors for the foreseeable future.” - Alex Shaw, Quantium’s Head of Technology Operations
  • 62. WORLD’S LARGEST BIOMETRIC IDENTITY SYSTEM: AADHAAR EXPERIENCE
  • 63. World's Largest Biometric Identity System: Aadhaar Experience • 1.2 billion residents – 640,000 villages, ~60% under $2/day, ~75% literacy, – <3% pays Income Tax, <20% banking, – ~1 billion mobile connections – ~300-400m migrant workers • $50 billion direct subsidies every year! – Residents have no standard verifiable identity – Most programs plagued with ghost and multiple identities causing leakage of 20-40%
  • 64. Demographic Data • Compulsory data: – Name, Age/Date of Birth, Gender and – Address of the resident • Optional data: – Mobile number – Email address Biometric Data Photograph All 10 fingerprints Both Iris World's Largest Biometric Identity System: Aadhaar Experience 12-digit Aadhaar Number Unique, lifetime, biometric based identity
  • 65. World's Largest Biometric Identity System: Aadhaar Experience
  • 66. Concluding Spark? Spark Streaming batches of X seconds IoT live data stream processed results Hadoop Understand ExploreModel Assess Data Science
  • 67.
  • 68.
  • 69. Big Data Implementation Road Map PLAN BUILD MANAGE Understand ExploreModel Assess Discovery Workshop Proof of Concept Validation Plan, Design, Implement Support / Managed Services
  • 70. Please take some time to fill up the feedback form and the Question Sheet