Mais conteúdo relacionado Semelhante a Today’s reality Hadoop with Spark- How to select the best Data Science approach when using Big Data Platforms and Technologies? (20) Today’s reality Hadoop with Spark- How to select the best Data Science approach when using Big Data Platforms and Technologies? 2. CONFIDENTIAL+ +++++++| 2© 2015 Think Big, a Teradata Company
Think+Big+History
1st
SI+Solution+Provider+with+100%+focus+on+open+source+
and+Big+Data+Hadoop ecosystem
• 100++Successful+Programs
• 70++Clients
• Global+Delivery+Capabilities
• We-are-hiring
3. CONFIDENTIAL+ +++++++| 3© 2015 Think Big, a Teradata Company
Think-Big-Clients
Trusted&Analytics&Services&Provider&to&the&Fortune&1000
eCommerce
2+of+Global+Top+5
Internet-Transaction-Security
Global #1
Retail
2+of+Global+Top+5
Brokerage &-Mutual-Funds
2+of+Global+Top+5
Social-Networking
Global #1
Asset-Management
Global #1
Credit-Issuer
2+of Global+Top+5
Semiconductor
2+of+Global Top+5
Banking
4+of+Global Top+10
Data Storage-Devices
3+of+Global Top+5
Financial Data-Services
2+of+Global+Top+5
Disk Manufacturing
Global+#1
Financial-Exchanges
Global #2
Telecommunications
2+of+Global Top+5
Media-& Advertising
2+of+Global+Top+5
4. CONFIDENTIAL+ +++++++| 4© 2015 Think Big, a Teradata Company
Think+Big+VELOCITY Methodology
Big+Data
Strategy
Think+Big
Academy
Big+Data
Program+Mgt
Business
Analytics
Managed+
Services
Data+
Engineering
Big+Data+Lab
Think+Big+engages+with+it’s+client’s+business,+technical,+analyst+and+support+teams+in+
an+agile+inspired+VELOCITY+Methodology+to+continuously+develop+Big+Data+solutions+
5. CONFIDENTIAL+ +++++++| 5© 2015 Think Big, a Teradata Company
What+is+Apache+Spark?+
• Open+source+Apache+project
− Parallel+middleware+for+server+
clusters
− Spark.apache.org+(2014)
• Developed+by+UC+Berkeley’s+
AMPLab
− Supported+by+Databricks
• Top+use+cases
− SQLaonaHadoop
− Machine+learning
− Streaming+data+miniabatches
6. CONFIDENTIAL+ +++++++| 6© 2015 Think Big, a Teradata Company
Apache-Spark-Core-Engine
Spark-SQL
Spark-
Streaming
MLib
(Machine-learning)
GraphX
(Graph)
Scala,-R-(SparkR),-Python-(PySpark)
What+is+Apache+Spark?+
7. CONFIDENTIAL+ +++++++| 7© 2015 Think Big, a Teradata Company
Data+Science+Approaches
7
Single-Workstation
- Small+data+sets
- No+distributed+analytics+
across+multiple+nodes
- Powerful+tools+are+R+or+
Python
- Data+Scientist+can+focus+on+
business+problem
Mixed
Single/Workstation/+/Cluster
- Small+or+large+data+sets
- Data+wrangling+and+feature+
engineering+is+performed+on+
cluster
- Predictive+analysis+and+
modeling+can+be+performed+on+
single+workstation
- Powerful+tools+are+Hadoop
Streaming+and+Spark
combined+with+R+and+Python
- Data+Scientist+now+have+to+
worry+about+parallelisation of+
some+data+mining+tasks+
(ususally the+ones+that+are+
embarrassingly+parallel)
Cluster
- Large+data+sets
- Both+data+wrangling+and+
modeling+is+performed+on+
cluster
- Spark+is+one+of+the+few+tools+
that+support+efficient+parallel+
machine+learning
- Parallelising machine+learning+
algorithms+is+challenging
8. CONFIDENTIAL+ +++++++| 8© 2015 Think Big, a Teradata Company
Data-Lake-(HDFS)
Core-Data-ScienceProduction
• Dashboards
• R+Shiny+Apps
• Predictive+model+
scoring
Plug+&+play+model+deployment
Data-Sources-
(Operations,+
Sales,+
marketing,+etc)
Ingestion
Realatime+
Optimization+with+
Multiaarmed+Bandit
Data
• Integration+of+R+and+
Python+with+Hadoop and+
Spark
• Leveraging+computing+
power+of+Hadoop cluster+
for+distributed+analytics
• Plug+&+play+model+
deployment+tools+for+
easy+and+robust+
productionising of+
analytics+models
Realatime+Data
Productionising Analytics
9. CONFIDENTIAL+ +++++++| 9© 2015 Think Big, a Teradata Company
Project-KickVoff
Data-Profiling-
and-Exploratory-
Analysis
Analytics-
Modeling
Model-Validation Model-Publishing Reporting
Data-Science-Project
Data+Science+and+Analytics+Overview
10. CONFIDENTIAL+ +++++++| 10© 2015 Think Big, a Teradata Company
We+leverage+our+expertise+across+industries
Dynamic-Pricing
Fraud-Detection
Customer-Segmentation
Recommendation-
Engine
Predictive-Asset-
Maintenance
Proactive-
Customer-
Support
Credit-Default-
Prediction
Churn-Modeling
Scenario-Simulation
A/B-Testing
Display-Targeting-Optimisation
Demand-Forecast
Cluster-Analysis-&-
Segmentation
Device-Analytics
Risk-Analytics
Customer-Analytics