SlideShare uma empresa Scribd logo
1 de 68
Essential Data Engineering
for Data Scientist
Me, myself, and I: Valentyn Kropov
• Sr. Big Data Solutions Architect.
• 14 years of work experience with Databases.
• 4 years in Big Data.
• Big Data Consulting Lead at SoftServe (20+ Engineers and Architects).
• Founder of Kyiv Big Data Community (600+ people).
webinar
Agenda
1. Level of Involvement
2. Choosing the Right Tools (Distribution of Hadoop)
3. RDBMS vs. NoSQL
4. NoSQL Data Modeling
5. Deployment
6. On-Premises vs. Cloud
7. Scalability and Performance
8. Storage
webinar
Level of Involvement
Who Should be Leading Data Science Projects?
Project Stages from Data Engineering Perspective
1. Statement of work
2. Requirements
3. Architecture
4. Infrastructure
5. Data modeling/ETL
6. Data Science modeling
webinar
Involvement: Checklist
1. You’re the boss!
2. You have a right to demand the infrastructure you need.
3. But, you need to have perfect argumentation.
4. And I’ll show it to you right now. 
webinar
Choosing the Right Tools
Big Data Landscape 2016
http://goo.gl/Rp9Axm
Big Data Analytics Reference Architecture
A modern-integrated approach for solving Big Data/Business Analytics needs
across multiple verticals and domains.
All Data
Real-time Data Processing
Data Acquisition and Storing
DataIntegration Enterprise
Data Warehousing
Data Management
(Governance, Security, Quality, MDM)
Analytics
Reporting and
Analysis
Predictive
Modeling
Data Mining
Data Lake
(Landing, Exploration
and Archiving)
UX and
Visualization
Applications
Application
data
Media data:
images,
video, etc
Social data
Enterprise
content
data
Machine,
sensor, log
data
Docs and
archives
data
Customer
Analytics
Marketing
Analytics
Web/Mobile/
Social
Analytics
IT
Operational
Analytics
Fraud and
Risk
Analytics
Complex Event
Processing
Real-time Query
and Search
Hortonworks vs. Cloudera vs. MapR
Hortonworks Cloudera MapR
File system HDFS HDFS MapR FS
Non-Hadoop Access NFS Fuse-DFS Direct Access NFS
Data Integration Services TalenD - -
Data Analysis Framework - Data Fu -
Software Abstraction Layer - - Apache Cascading
Web Access WebHDFS HTTPFS -
Parallel Query Execution Tez (Stinger) Impala -
Installation Ambari Cloudera Manager -
Security - Sentry -
Monitoring Gangila/Nagios - -
Non-mapr Reduce Tasks YARN YARN -
http://www.networkworld.com/article/2369327/software/comparing-the-top-hadoop-distributions.html
webinar
Or Even More: IBM, Oracle, Amazon, …
1. IBM: Big R (set of Data Science algorithms) and Big SQL (SQL-like interface to data).
2. Oracle: Big Data appliance/connectors.
3. Amazon: Elastic MapReduce.
Choosing the Right Tools: Example (Description)
Data Volume:
• 270-300 Web Servers (Apache HTTPD)
• 447 392 events per minute
• 644 245 094 events / day
• ~100-250 bytes per event
• 150GB of data per day
Log Types:
• Apache HTTPD access log
• Apache HTTPD error log
• Service log (CPU, RAM, I/O, Disk)
• Application server servlet log
Retention:
• Last 30 days: Raw data
• Last 24 hours: per minute aggregation
• Whole period: per hour aggregation
Choosing the Right Tools: Example (Marketecture)
Choosing the Right Tools: Example (Description - data)
Access log:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Error log:
[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome
Vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0
iostat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011
avg-cpu: %user %nice %system %iowait %steal %idle
5.68 0.00 0.52 2.03 0.00 91.76
webinar
Choosing the Right Tools: Example (Description - data)
webinar
Choosing the Right Tools: Example (Proof-of-Concept)
4200 events / second
webinar
Choosing the Right Tools: Example (Compression & speed)
Compression Ratio
Access Speed
webinar
Choosing the Right Tools: Example (Accurate sizing)
Choosing the Right Tools: Checklist
1. Fastest random access to the data: Cloudera (Impala).
2. Universal (and fast!) access to data: MapR (MapR FS).
3. Data Integration: Hortonworks (built-in TalenD).
4. Never trust papers, always double check: Proof-of-Concept.
5. Lastly, ensure you have rightsizing and check every element of the chain!
webinar
RDBMS vs. NoSQL
RDBMS vs. NoSQL
http://www.datastax.com/nosql-databases
webinar
It’s Not Necessarily Always Black and White!
• Traditional-relational
• Extended-relational
• Non-relational
• Lambda architecture (Hybrid)
• Data refinery (Hybrid)
webinar
SoftServe Lambda Architecture Accelerator
• Lambda architecture – Is a highly scalable and reliable data processing architecture based on Twitter
successful experience in Big Data and Analytics.
• Supports majority of use cases: Real-time analytics, data discovery, and business reports.
• SoftServe’s pre-built Lambda architecture stack accelerates customer’s Time to Market (TTM) to 15-20+
man/month.
RDBMS vs NoSQL: Checklist
1. RDBMS: Structured data, moderate velocity and volume (up to TB),
with complex transactions.
2. NoSQL: Unstructured data, high velocity or volume (up to PB+),
with simple transactions.
3. Hybrid, Lambda, Refinery: Something in-between.
NoSQL Data Modeling
NoSQL: How is it Different than RDBMS?
1. Write operations are cheap.
2. Less transactions and is less consistent.
3. Read operations are blazingly fast!
webinar
NoSQL: Two Main Rules to Remember
1. Spread Data evenly around the cluster.
2. Minimize the number of partitions read.
webinar
RDBMS: Queries Around Model
Q1: People who live in state X.
Q2: People who live in city Y.
Q3: People who live at address Z.
webinar
NoSQL: Model Around Queries!
Q1: People who live in state X. Q2: People who live in city Y. Q3: People who live at address Z.
People_by_States
state - Partition / Primary Key
country
first_name
last_name
city
street_name1
street_name2
street_number
People_by_City
city - Partition / Primary Key
country
first_name
last_name
state
street_name1
street_name2
street_number
People_by_FullAddress
country, city, state, street_name1 –
Partition / Primary Key
first_name
last_name
street_name2
street_number
webinar
Data Modeling: Checklist
1. In NoSQL, you can have a table for each query, and it is totally OK, don’t save disk space!
(sacrifice cheap writes for the fastest reads).
2. There are (almost) no secondary indexes in NoSQL, only primary.
3. Pick up correct primary (partitioning) key to read only one partition per request.
webinar
Deployment
Deployment Defined
In short, deployment is the litmus paper for a project that defines the level of maturity.
And, the overall project success depends on it.
webinar
Deployment Stages
1. Bootstrapping: Create VM’s and hosts.
2. Provisioning: Install software like Hadoop.
3. Configuration: Initial parameters and data.
4. Validation: Verify installation.
webinar
Deployment: Manual vs. Automation
“Architectural Support for DevOps in a
Neo-Metropolis BDaaS Platform” © Valentyn Kropov,
Serge Haziyev, Rick Kazman, Hong-Mei Chen
Time Savings of: 89.75%!
webinar
Deployment: Automation
Provisioning,
configuration, and
verification
(Ansible, Cloudera
Director, Cloudera
Manager, Ambari,
Cloud Break)
Bootstrapping
(Terraform)
VM1 VM2 VM3 VM4 VM5 VM4
AWS / Open Stack / Google Cloud
webinar
Deployment: Automation (Hadoop Cluster)
1. Bootstrapping: HoshiCorp Terraform.
2. Provisioning & Configuration: Cloudera Director.
3. Validation: Cloudera Manager API.
webinar
Service Layout & Memory Allocation
http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-
hadoop-clusters-like-a-boss/
Automation: Checklist
1. Deployment should be fully automated (Terraform and Ansible).
2. Ensure service layout is correct (master nodes, worker nodes, and edge nodes).
3. Double check to see if enough memory has been given for nodes
(~64-128GB for master/edge nodes, ~256-512GB for data/workers nodes).
webinar
On-Premises vs. Cloud
On-Premises
(real hardware somewhere in your building or data center)
1. Highest data privacy (Regulations and sensitive data).
2. Quickest access to data (Latency).
3. Best velocity (Transfer rates).
4. Existing Hardware.
5. Control over resource usage.
webinar
Cloud (Amazon, Azure, etc.)
1. Efficient cost-reduction.
2. Universal access.
3. Flexibility.
4. Choice of applications.
5. Built-in maintenance and support.
6. Scalability!
webinar
Hybrid
1. Hybrid: a combination of on-premises and cloud.
2. On-premises: sensitive information and data for high-performance access.
3. Cloud: non-sensitive data.
webinar
On-Premises vs. Cloud
1. Oracle ExaData ~ $1.000.000
2. Biggest instance in Amazon EC2 (40CPU) ~ 50 years!
webinar
On-Premises vs. Cloud: Checklist
1. On-premises: If customer has existing unused hardware, has predicted data volume
growth, or has strong data security requirements.
2. Cloud: If the customer doesn’t have a large budget, is not sure about data & load
growth, and doesn’t have strong security requirements or a team of engineers to
support hardware.
3. Hybrid: Mixture of requirements above.
webinar
Scalability & Performance
Dedicated Clusters
Visualization Service
Data Ingestion Service
Analytics Service
VM1 VM2 VM3
VM1 VM2 VM2
VM4 VM5 VM6
VM7 VM8
• Configuration and
management of 3
separate clusters.
• Resources stay idle if
service is not active.
• Need to move data
between clusters for
each service.
webinar
Shared Clusters
Visualization Service
Data Ingestion Service
Analytics Service
Multiple clusters
Multiple clusters
...to maximize utilization
...to share data between services
webinar
Shared Clusters: Mesos/Docker
OpenStack / AWS / Google Cloud / Azure
VM5VM1 VM2 VM3 VM4
Shared Clusters: Mesos/Docker
Maximize utilization & performance:
Deliver more services with smaller footprint.
Shared clusters for all services:
Easier deployment and management with unified service platform.
Shared data between services:
Faster and more competitive services and solutions.
webinar
How Does this Work?
Zookeeper quorum
Mesos Master Mesos Master Mesos Master
Spark Service Scheduler Marathon Service Scheduler
Mesos Slave
Spark Task Executor Mesos Executor
Mesos Slave
Docker Executor Docker Executor
Task #1 Task #2 ./python XYZ java -jar XYZ.jar ./xyz
How Does this Work?
Mesos provides fine grained resource isolation
Mesos Slave Process
Spark Task Executor Mesos Executor
Task #2 ./python XYZ
Compute Node
Executor
Container
(cgroups)
Task #1
webinar
How Does this Work?
Mesos provides scalability
Mesos Slave Process
Spark Task Executor
Task #2
Compute Node
Container
(cgroups)
Task #1
Python executor finished,
more available resources,
and more spark.
Task #4Task #3
webinar
How Does this Work?
VM5VM1 VM2 VM3 VM4
Mesos has no single point of failure Services keep running if VM fails!
Mesos Master
Mesos Master Mesos Master
webinar
How Does this Work?
VM5VM1 VM2 VM3 VM4
Master node can failover Services keep running if Mesos Master fails!
Mesos Master
Mesos Master Mesos Master
webinar
How Does this Work?
Slave process can failover Tasks keep running if Mesos Slave Process fails!
Mesos Slave Process
Spark Task Executor
Task #2
Compute Node
Task #1 Task #4Task #3
webinar
Scalability & Performance: Checklist
1. If you need real scalability then use shared clusters.
2. Shared clusters love to host in Cloud.
3. Scalability means performance (in most cases). Use it as a synonym.
webinar
Storage
Netflix Storage: Situation
1. ~25PB Data Warehouse on Amazon S3.
2. Read ~10% daily.
3. Write ~10% daily.
4. ~550 billion events daily.
5. ~350 active platform users (> 80% – Data Science engineers).
webinar
Netflix Storage: Architecture (2013)
http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
webinar
Netflix Storage: Architecture (2014)
http://techblog.netflix.com/2014/11/genie-20-second-wish-granted.html
Netflix Storage: Architecture (2015)
http://www.slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-the-netflix-big-data-
platform?qid=a9bda293-24df-4f6f-a06a-5b02eb751b35&v=&b=&from_search=1
Storage Comparison
1. Amazon S3: universal access, cheap, and data needs to be copied before processing.
2. HDFS: compatible with Hadoop ecosystem, relatively cheap, and data can be processed
where it is being stored.
3. Directly Attached Storage/Network Attached Storage: expensive, fastest access to data,
and it also can be processed where data is being stored.
webinar
Storage: Checklist
1. If you need unified access to data and use some universal Cloud FS,
then this would be similar to Amazon S3.
2. For immediate access to data (OLTP system), you need Directly
Attached Storage (DAS), Network Attached Storage (NAS), Elastic Block
Storage (Amazon EBS), and so on.
3. If you choose NoSQL, you’ll need much more space than actual data
(each query might require duplicate copy of data).
4. Pick storage carefully and use PoC/Prototyping, otherwise changing
storage later on will be hard to almost impossible.
webinar
Final Checklist
Final Checklist
1. You’re the Boss!
2. You have a right to demand the infrastructure you need.
3. However, you need to have perfect argumentation.
4. Now you have it and know where to get details.
5. Good luck and see you in the field! 
webinar
Contacts
vkrop@softserveinc.com
https://ua.linkedin.com/in/valentin-kropov-032a147
https://www.facebook.com/bigdatakyiv
webinar
USA HQ
Toll Free: 866-687-3588
Tel: +1-512-516-8880
Ukraine HQ
Tel: +380-32-240-9090
Bulgaria
Tel: +359-2-902-3760
Germany
Tel: +49-69-2602-5857
Netherlands
Tel: +31-20-262-33-23
Poland
Tel: +48-71-382-2800
UK
Tel: +44-207-544-8414
EMAIL
info@softserveinc.com
WEBSITE:
www.softserveinc.com
Thank you!

Mais conteúdo relacionado

Mais procurados

Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Codecamp Romania
 

Mais procurados (20)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
 
Data engineering
Data engineeringData engineering
Data engineering
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
 
Data Engineering for Data Scientists
Data Engineering for Data Scientists Data Engineering for Data Scientists
Data Engineering for Data Scientists
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
 
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...
 
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoIasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 

Destaque

Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
Ahsan Khan Eco (Superior College)
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Kai Wähner
 

Destaque (12)

Exploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryExploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban Forestry
 
Data Preparation for Data Science
Data Preparation for Data ScienceData Preparation for Data Science
Data Preparation for Data Science
 
Data Science in the cloud with Microsoft Azure
Data Science in the cloud with Microsoft Azure Data Science in the cloud with Microsoft Azure
Data Science in the cloud with Microsoft Azure
 
Grokking: Data Engineering Course
Grokking: Data Engineering CourseGrokking: Data Engineering Course
Grokking: Data Engineering Course
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Reinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapRReinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapR
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
 
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine LearningData Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
Data Preparation vs. Inline Data Wrangling in Data Science and Machine Learning
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
 
Culture Code: Creating A Lovable Company
Culture Code: Creating A Lovable CompanyCulture Code: Creating A Lovable Company
Culture Code: Creating A Lovable Company
 

Semelhante a Essential Data Engineering for Data Scientist

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 

Semelhante a Essential Data Engineering for Data Scientist (20)

Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Idera live 2021: Managing Databases in the Cloud - the First Step, a Succes...
Idera live 2021:   Managing Databases in the Cloud - the First Step, a Succes...Idera live 2021:   Managing Databases in the Cloud - the First Step, a Succes...
Idera live 2021: Managing Databases in the Cloud - the First Step, a Succes...
 

Mais de SoftServe

Mais de SoftServe (20)

Approaching Quality in Digital Era
Approaching Quality in Digital EraApproaching Quality in Digital Era
Approaching Quality in Digital Era
 
Digital Product Security
Digital Product SecurityDigital Product Security
Digital Product Security
 
Testing Tools and Tips
Testing Tools and TipsTesting Tools and Tips
Testing Tools and Tips
 
Android Mobile Application Testing: Human Interface Guideline, Tools
Android Mobile Application Testing: Human Interface Guideline, ToolsAndroid Mobile Application Testing: Human Interface Guideline, Tools
Android Mobile Application Testing: Human Interface Guideline, Tools
 
Android Mobile Application Testing: Specific Functional, Performance, Device ...
Android Mobile Application Testing: Specific Functional, Performance, Device ...Android Mobile Application Testing: Specific Functional, Performance, Device ...
Android Mobile Application Testing: Specific Functional, Performance, Device ...
 
How to Reduce Time to Market Using Microsoft DevOps Solutions
How to Reduce Time to Market Using Microsoft DevOps SolutionsHow to Reduce Time to Market Using Microsoft DevOps Solutions
How to Reduce Time to Market Using Microsoft DevOps Solutions
 
Containerization: The DevOps Revolution
Containerization: The DevOps Revolution Containerization: The DevOps Revolution
Containerization: The DevOps Revolution
 
Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS
 
Implementing Test Automation: What a Manager Should Know
Implementing Test Automation: What a Manager Should KnowImplementing Test Automation: What a Manager Should Know
Implementing Test Automation: What a Manager Should Know
 
Using AWS Lambda for Infrastructure Automation and Beyond
Using AWS Lambda for Infrastructure Automation and BeyondUsing AWS Lambda for Infrastructure Automation and Beyond
Using AWS Lambda for Infrastructure Automation and Beyond
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for InnovationBig Data as a Service: A Neo-Metropolis Model Approach for Innovation
Big Data as a Service: A Neo-Metropolis Model Approach for Innovation
 
Personalized Medicine in a Contemporary World by Eugene Borukhovich, SVP Heal...
Personalized Medicine in a Contemporary World by Eugene Borukhovich, SVP Heal...Personalized Medicine in a Contemporary World by Eugene Borukhovich, SVP Heal...
Personalized Medicine in a Contemporary World by Eugene Borukhovich, SVP Heal...
 
Health 2.0 WinterTech: Will Artificial Intelligence change healthcare? by Eug...
Health 2.0 WinterTech: Will Artificial Intelligence change healthcare? by Eug...Health 2.0 WinterTech: Will Artificial Intelligence change healthcare? by Eug...
Health 2.0 WinterTech: Will Artificial Intelligence change healthcare? by Eug...
 
Managing Requirements with Word and TFS by Max Markov
Managing Requirements with Word and TFS by Max MarkovManaging Requirements with Word and TFS by Max Markov
Managing Requirements with Word and TFS by Max Markov
 
How to Implement Hybrid Cloud Solutions Successfully
How to Implement Hybrid Cloud Solutions SuccessfullyHow to Implement Hybrid Cloud Solutions Successfully
How to Implement Hybrid Cloud Solutions Successfully
 
Designing Big Data Systems Like a Pro
Designing Big Data Systems Like a ProDesigning Big Data Systems Like a Pro
Designing Big Data Systems Like a Pro
 
Product Management in Outsourcing by Roman Kolodchak and Roman Pavlyuk
Product Management in Outsourcing by Roman Kolodchak and Roman PavlyukProduct Management in Outsourcing by Roman Kolodchak and Roman Pavlyuk
Product Management in Outsourcing by Roman Kolodchak and Roman Pavlyuk
 
From Sandbox to Production by Vadym Fedorov
From Sandbox to Production by Vadym FedorovFrom Sandbox to Production by Vadym Fedorov
From Sandbox to Production by Vadym Fedorov
 
Why Ukraine? by Brian Borack, COO
Why Ukraine? by Brian Borack, COOWhy Ukraine? by Brian Borack, COO
Why Ukraine? by Brian Borack, COO
 

Último

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 

Último (20)

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 

Essential Data Engineering for Data Scientist

  • 2. Me, myself, and I: Valentyn Kropov • Sr. Big Data Solutions Architect. • 14 years of work experience with Databases. • 4 years in Big Data. • Big Data Consulting Lead at SoftServe (20+ Engineers and Architects). • Founder of Kyiv Big Data Community (600+ people). webinar
  • 3. Agenda 1. Level of Involvement 2. Choosing the Right Tools (Distribution of Hadoop) 3. RDBMS vs. NoSQL 4. NoSQL Data Modeling 5. Deployment 6. On-Premises vs. Cloud 7. Scalability and Performance 8. Storage webinar
  • 5. Who Should be Leading Data Science Projects?
  • 6. Project Stages from Data Engineering Perspective 1. Statement of work 2. Requirements 3. Architecture 4. Infrastructure 5. Data modeling/ETL 6. Data Science modeling webinar
  • 7. Involvement: Checklist 1. You’re the boss! 2. You have a right to demand the infrastructure you need. 3. But, you need to have perfect argumentation. 4. And I’ll show it to you right now.  webinar
  • 9. Big Data Landscape 2016 http://goo.gl/Rp9Axm
  • 10. Big Data Analytics Reference Architecture A modern-integrated approach for solving Big Data/Business Analytics needs across multiple verticals and domains. All Data Real-time Data Processing Data Acquisition and Storing DataIntegration Enterprise Data Warehousing Data Management (Governance, Security, Quality, MDM) Analytics Reporting and Analysis Predictive Modeling Data Mining Data Lake (Landing, Exploration and Archiving) UX and Visualization Applications Application data Media data: images, video, etc Social data Enterprise content data Machine, sensor, log data Docs and archives data Customer Analytics Marketing Analytics Web/Mobile/ Social Analytics IT Operational Analytics Fraud and Risk Analytics Complex Event Processing Real-time Query and Search
  • 11. Hortonworks vs. Cloudera vs. MapR Hortonworks Cloudera MapR File system HDFS HDFS MapR FS Non-Hadoop Access NFS Fuse-DFS Direct Access NFS Data Integration Services TalenD - - Data Analysis Framework - Data Fu - Software Abstraction Layer - - Apache Cascading Web Access WebHDFS HTTPFS - Parallel Query Execution Tez (Stinger) Impala - Installation Ambari Cloudera Manager - Security - Sentry - Monitoring Gangila/Nagios - - Non-mapr Reduce Tasks YARN YARN - http://www.networkworld.com/article/2369327/software/comparing-the-top-hadoop-distributions.html webinar
  • 12. Or Even More: IBM, Oracle, Amazon, … 1. IBM: Big R (set of Data Science algorithms) and Big SQL (SQL-like interface to data). 2. Oracle: Big Data appliance/connectors. 3. Amazon: Elastic MapReduce.
  • 13. Choosing the Right Tools: Example (Description) Data Volume: • 270-300 Web Servers (Apache HTTPD) • 447 392 events per minute • 644 245 094 events / day • ~100-250 bytes per event • 150GB of data per day Log Types: • Apache HTTPD access log • Apache HTTPD error log • Service log (CPU, RAM, I/O, Disk) • Application server servlet log Retention: • Last 30 days: Raw data • Last 24 hours: per minute aggregation • Whole period: per hour aggregation
  • 14. Choosing the Right Tools: Example (Marketecture)
  • 15. Choosing the Right Tools: Example (Description - data) Access log: 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log: [Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed [Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostat Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76 webinar
  • 16. Choosing the Right Tools: Example (Description - data) webinar
  • 17. Choosing the Right Tools: Example (Proof-of-Concept) 4200 events / second webinar
  • 18. Choosing the Right Tools: Example (Compression & speed) Compression Ratio Access Speed webinar
  • 19. Choosing the Right Tools: Example (Accurate sizing)
  • 20. Choosing the Right Tools: Checklist 1. Fastest random access to the data: Cloudera (Impala). 2. Universal (and fast!) access to data: MapR (MapR FS). 3. Data Integration: Hortonworks (built-in TalenD). 4. Never trust papers, always double check: Proof-of-Concept. 5. Lastly, ensure you have rightsizing and check every element of the chain! webinar
  • 23. It’s Not Necessarily Always Black and White! • Traditional-relational • Extended-relational • Non-relational • Lambda architecture (Hybrid) • Data refinery (Hybrid) webinar
  • 24. SoftServe Lambda Architecture Accelerator • Lambda architecture – Is a highly scalable and reliable data processing architecture based on Twitter successful experience in Big Data and Analytics. • Supports majority of use cases: Real-time analytics, data discovery, and business reports. • SoftServe’s pre-built Lambda architecture stack accelerates customer’s Time to Market (TTM) to 15-20+ man/month.
  • 25. RDBMS vs NoSQL: Checklist 1. RDBMS: Structured data, moderate velocity and volume (up to TB), with complex transactions. 2. NoSQL: Unstructured data, high velocity or volume (up to PB+), with simple transactions. 3. Hybrid, Lambda, Refinery: Something in-between.
  • 27. NoSQL: How is it Different than RDBMS? 1. Write operations are cheap. 2. Less transactions and is less consistent. 3. Read operations are blazingly fast! webinar
  • 28. NoSQL: Two Main Rules to Remember 1. Spread Data evenly around the cluster. 2. Minimize the number of partitions read. webinar
  • 29. RDBMS: Queries Around Model Q1: People who live in state X. Q2: People who live in city Y. Q3: People who live at address Z. webinar
  • 30. NoSQL: Model Around Queries! Q1: People who live in state X. Q2: People who live in city Y. Q3: People who live at address Z. People_by_States state - Partition / Primary Key country first_name last_name city street_name1 street_name2 street_number People_by_City city - Partition / Primary Key country first_name last_name state street_name1 street_name2 street_number People_by_FullAddress country, city, state, street_name1 – Partition / Primary Key first_name last_name street_name2 street_number webinar
  • 31. Data Modeling: Checklist 1. In NoSQL, you can have a table for each query, and it is totally OK, don’t save disk space! (sacrifice cheap writes for the fastest reads). 2. There are (almost) no secondary indexes in NoSQL, only primary. 3. Pick up correct primary (partitioning) key to read only one partition per request. webinar
  • 33. Deployment Defined In short, deployment is the litmus paper for a project that defines the level of maturity. And, the overall project success depends on it. webinar
  • 34. Deployment Stages 1. Bootstrapping: Create VM’s and hosts. 2. Provisioning: Install software like Hadoop. 3. Configuration: Initial parameters and data. 4. Validation: Verify installation. webinar
  • 35. Deployment: Manual vs. Automation “Architectural Support for DevOps in a Neo-Metropolis BDaaS Platform” © Valentyn Kropov, Serge Haziyev, Rick Kazman, Hong-Mei Chen Time Savings of: 89.75%! webinar
  • 36. Deployment: Automation Provisioning, configuration, and verification (Ansible, Cloudera Director, Cloudera Manager, Ambari, Cloud Break) Bootstrapping (Terraform) VM1 VM2 VM3 VM4 VM5 VM4 AWS / Open Stack / Google Cloud webinar
  • 37. Deployment: Automation (Hadoop Cluster) 1. Bootstrapping: HoshiCorp Terraform. 2. Provisioning & Configuration: Cloudera Director. 3. Validation: Cloudera Manager API. webinar
  • 38. Service Layout & Memory Allocation http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache- hadoop-clusters-like-a-boss/
  • 39. Automation: Checklist 1. Deployment should be fully automated (Terraform and Ansible). 2. Ensure service layout is correct (master nodes, worker nodes, and edge nodes). 3. Double check to see if enough memory has been given for nodes (~64-128GB for master/edge nodes, ~256-512GB for data/workers nodes). webinar
  • 41. On-Premises (real hardware somewhere in your building or data center) 1. Highest data privacy (Regulations and sensitive data). 2. Quickest access to data (Latency). 3. Best velocity (Transfer rates). 4. Existing Hardware. 5. Control over resource usage. webinar
  • 42. Cloud (Amazon, Azure, etc.) 1. Efficient cost-reduction. 2. Universal access. 3. Flexibility. 4. Choice of applications. 5. Built-in maintenance and support. 6. Scalability! webinar
  • 43. Hybrid 1. Hybrid: a combination of on-premises and cloud. 2. On-premises: sensitive information and data for high-performance access. 3. Cloud: non-sensitive data. webinar
  • 44. On-Premises vs. Cloud 1. Oracle ExaData ~ $1.000.000 2. Biggest instance in Amazon EC2 (40CPU) ~ 50 years! webinar
  • 45. On-Premises vs. Cloud: Checklist 1. On-premises: If customer has existing unused hardware, has predicted data volume growth, or has strong data security requirements. 2. Cloud: If the customer doesn’t have a large budget, is not sure about data & load growth, and doesn’t have strong security requirements or a team of engineers to support hardware. 3. Hybrid: Mixture of requirements above. webinar
  • 47. Dedicated Clusters Visualization Service Data Ingestion Service Analytics Service VM1 VM2 VM3 VM1 VM2 VM2 VM4 VM5 VM6 VM7 VM8 • Configuration and management of 3 separate clusters. • Resources stay idle if service is not active. • Need to move data between clusters for each service. webinar
  • 48. Shared Clusters Visualization Service Data Ingestion Service Analytics Service Multiple clusters Multiple clusters ...to maximize utilization ...to share data between services webinar
  • 49. Shared Clusters: Mesos/Docker OpenStack / AWS / Google Cloud / Azure VM5VM1 VM2 VM3 VM4
  • 50. Shared Clusters: Mesos/Docker Maximize utilization & performance: Deliver more services with smaller footprint. Shared clusters for all services: Easier deployment and management with unified service platform. Shared data between services: Faster and more competitive services and solutions. webinar
  • 51. How Does this Work? Zookeeper quorum Mesos Master Mesos Master Mesos Master Spark Service Scheduler Marathon Service Scheduler Mesos Slave Spark Task Executor Mesos Executor Mesos Slave Docker Executor Docker Executor Task #1 Task #2 ./python XYZ java -jar XYZ.jar ./xyz
  • 52. How Does this Work? Mesos provides fine grained resource isolation Mesos Slave Process Spark Task Executor Mesos Executor Task #2 ./python XYZ Compute Node Executor Container (cgroups) Task #1 webinar
  • 53. How Does this Work? Mesos provides scalability Mesos Slave Process Spark Task Executor Task #2 Compute Node Container (cgroups) Task #1 Python executor finished, more available resources, and more spark. Task #4Task #3 webinar
  • 54. How Does this Work? VM5VM1 VM2 VM3 VM4 Mesos has no single point of failure Services keep running if VM fails! Mesos Master Mesos Master Mesos Master webinar
  • 55. How Does this Work? VM5VM1 VM2 VM3 VM4 Master node can failover Services keep running if Mesos Master fails! Mesos Master Mesos Master Mesos Master webinar
  • 56. How Does this Work? Slave process can failover Tasks keep running if Mesos Slave Process fails! Mesos Slave Process Spark Task Executor Task #2 Compute Node Task #1 Task #4Task #3 webinar
  • 57. Scalability & Performance: Checklist 1. If you need real scalability then use shared clusters. 2. Shared clusters love to host in Cloud. 3. Scalability means performance (in most cases). Use it as a synonym. webinar
  • 59. Netflix Storage: Situation 1. ~25PB Data Warehouse on Amazon S3. 2. Read ~10% daily. 3. Write ~10% daily. 4. ~550 billion events daily. 5. ~350 active platform users (> 80% – Data Science engineers). webinar
  • 60. Netflix Storage: Architecture (2013) http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html webinar
  • 61. Netflix Storage: Architecture (2014) http://techblog.netflix.com/2014/11/genie-20-second-wish-granted.html
  • 62. Netflix Storage: Architecture (2015) http://www.slideshare.net/AmazonWebServices/bdt303-running-spark-and-presto-on-the-netflix-big-data- platform?qid=a9bda293-24df-4f6f-a06a-5b02eb751b35&v=&b=&from_search=1
  • 63. Storage Comparison 1. Amazon S3: universal access, cheap, and data needs to be copied before processing. 2. HDFS: compatible with Hadoop ecosystem, relatively cheap, and data can be processed where it is being stored. 3. Directly Attached Storage/Network Attached Storage: expensive, fastest access to data, and it also can be processed where data is being stored. webinar
  • 64. Storage: Checklist 1. If you need unified access to data and use some universal Cloud FS, then this would be similar to Amazon S3. 2. For immediate access to data (OLTP system), you need Directly Attached Storage (DAS), Network Attached Storage (NAS), Elastic Block Storage (Amazon EBS), and so on. 3. If you choose NoSQL, you’ll need much more space than actual data (each query might require duplicate copy of data). 4. Pick storage carefully and use PoC/Prototyping, otherwise changing storage later on will be hard to almost impossible. webinar
  • 66. Final Checklist 1. You’re the Boss! 2. You have a right to demand the infrastructure you need. 3. However, you need to have perfect argumentation. 4. Now you have it and know where to get details. 5. Good luck and see you in the field!  webinar
  • 68. USA HQ Toll Free: 866-687-3588 Tel: +1-512-516-8880 Ukraine HQ Tel: +380-32-240-9090 Bulgaria Tel: +359-2-902-3760 Germany Tel: +49-69-2602-5857 Netherlands Tel: +31-20-262-33-23 Poland Tel: +48-71-382-2800 UK Tel: +44-207-544-8414 EMAIL info@softserveinc.com WEBSITE: www.softserveinc.com Thank you!