2. Me, myself, and I: Valentyn Kropov
• Sr. Big Data Solutions Architect.
• 14 years of work experience with Databases.
• 4 years in Big Data.
• Big Data Consulting Lead at SoftServe (20+ Engineers and Architects).
• Founder of Kyiv Big Data Community (600+ people).
webinar
3. Agenda
1. Level of Involvement
2. Choosing the Right Tools (Distribution of Hadoop)
3. RDBMS vs. NoSQL
4. NoSQL Data Modeling
5. Deployment
6. On-Premises vs. Cloud
7. Scalability and Performance
8. Storage
webinar
6. Project Stages from Data Engineering Perspective
1. Statement of work
2. Requirements
3. Architecture
4. Infrastructure
5. Data modeling/ETL
6. Data Science modeling
webinar
7. Involvement: Checklist
1. You’re the boss!
2. You have a right to demand the infrastructure you need.
3. But, you need to have perfect argumentation.
4. And I’ll show it to you right now.
webinar
10. Big Data Analytics Reference Architecture
A modern-integrated approach for solving Big Data/Business Analytics needs
across multiple verticals and domains.
All Data
Real-time Data Processing
Data Acquisition and Storing
DataIntegration Enterprise
Data Warehousing
Data Management
(Governance, Security, Quality, MDM)
Analytics
Reporting and
Analysis
Predictive
Modeling
Data Mining
Data Lake
(Landing, Exploration
and Archiving)
UX and
Visualization
Applications
Application
data
Media data:
images,
video, etc
Social data
Enterprise
content
data
Machine,
sensor, log
data
Docs and
archives
data
Customer
Analytics
Marketing
Analytics
Web/Mobile/
Social
Analytics
IT
Operational
Analytics
Fraud and
Risk
Analytics
Complex Event
Processing
Real-time Query
and Search
11. Hortonworks vs. Cloudera vs. MapR
Hortonworks Cloudera MapR
File system HDFS HDFS MapR FS
Non-Hadoop Access NFS Fuse-DFS Direct Access NFS
Data Integration Services TalenD - -
Data Analysis Framework - Data Fu -
Software Abstraction Layer - - Apache Cascading
Web Access WebHDFS HTTPFS -
Parallel Query Execution Tez (Stinger) Impala -
Installation Ambari Cloudera Manager -
Security - Sentry -
Monitoring Gangila/Nagios - -
Non-mapr Reduce Tasks YARN YARN -
http://www.networkworld.com/article/2369327/software/comparing-the-top-hadoop-distributions.html
webinar
12. Or Even More: IBM, Oracle, Amazon, …
1. IBM: Big R (set of Data Science algorithms) and Big SQL (SQL-like interface to data).
2. Oracle: Big Data appliance/connectors.
3. Amazon: Elastic MapReduce.
13. Choosing the Right Tools: Example (Description)
Data Volume:
• 270-300 Web Servers (Apache HTTPD)
• 447 392 events per minute
• 644 245 094 events / day
• ~100-250 bytes per event
• 150GB of data per day
Log Types:
• Apache HTTPD access log
• Apache HTTPD error log
• Service log (CPU, RAM, I/O, Disk)
• Application server servlet log
Retention:
• Last 30 days: Raw data
• Last 24 hours: per minute aggregation
• Whole period: per hour aggregation
15. Choosing the Right Tools: Example (Description - data)
Access log:
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Error log:
[Sun Mar 7 20:58:27 2004] [info] [client 64.242.88.10] (104)Connection reset by peer: client stopped connection before send body completed
[Sun Mar 7 21:16:17 2004] [error] [client 24.70.56.49] File does not exist: /home/httpd/twiki/view/Main/WebHome
Vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0
iostat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011
avg-cpu: %user %nice %system %iowait %steal %idle
5.68 0.00 0.52 2.03 0.00 91.76
webinar
20. Choosing the Right Tools: Checklist
1. Fastest random access to the data: Cloudera (Impala).
2. Universal (and fast!) access to data: MapR (MapR FS).
3. Data Integration: Hortonworks (built-in TalenD).
4. Never trust papers, always double check: Proof-of-Concept.
5. Lastly, ensure you have rightsizing and check every element of the chain!
webinar
23. It’s Not Necessarily Always Black and White!
• Traditional-relational
• Extended-relational
• Non-relational
• Lambda architecture (Hybrid)
• Data refinery (Hybrid)
webinar
24. SoftServe Lambda Architecture Accelerator
• Lambda architecture – Is a highly scalable and reliable data processing architecture based on Twitter
successful experience in Big Data and Analytics.
• Supports majority of use cases: Real-time analytics, data discovery, and business reports.
• SoftServe’s pre-built Lambda architecture stack accelerates customer’s Time to Market (TTM) to 15-20+
man/month.
25. RDBMS vs NoSQL: Checklist
1. RDBMS: Structured data, moderate velocity and volume (up to TB),
with complex transactions.
2. NoSQL: Unstructured data, high velocity or volume (up to PB+),
with simple transactions.
3. Hybrid, Lambda, Refinery: Something in-between.
27. NoSQL: How is it Different than RDBMS?
1. Write operations are cheap.
2. Less transactions and is less consistent.
3. Read operations are blazingly fast!
webinar
28. NoSQL: Two Main Rules to Remember
1. Spread Data evenly around the cluster.
2. Minimize the number of partitions read.
webinar
29. RDBMS: Queries Around Model
Q1: People who live in state X.
Q2: People who live in city Y.
Q3: People who live at address Z.
webinar
30. NoSQL: Model Around Queries!
Q1: People who live in state X. Q2: People who live in city Y. Q3: People who live at address Z.
People_by_States
state - Partition / Primary Key
country
first_name
last_name
city
street_name1
street_name2
street_number
People_by_City
city - Partition / Primary Key
country
first_name
last_name
state
street_name1
street_name2
street_number
People_by_FullAddress
country, city, state, street_name1 –
Partition / Primary Key
first_name
last_name
street_name2
street_number
webinar
31. Data Modeling: Checklist
1. In NoSQL, you can have a table for each query, and it is totally OK, don’t save disk space!
(sacrifice cheap writes for the fastest reads).
2. There are (almost) no secondary indexes in NoSQL, only primary.
3. Pick up correct primary (partitioning) key to read only one partition per request.
webinar
33. Deployment Defined
In short, deployment is the litmus paper for a project that defines the level of maturity.
And, the overall project success depends on it.
webinar
34. Deployment Stages
1. Bootstrapping: Create VM’s and hosts.
2. Provisioning: Install software like Hadoop.
3. Configuration: Initial parameters and data.
4. Validation: Verify installation.
webinar
38. Service Layout & Memory Allocation
http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-
hadoop-clusters-like-a-boss/
39. Automation: Checklist
1. Deployment should be fully automated (Terraform and Ansible).
2. Ensure service layout is correct (master nodes, worker nodes, and edge nodes).
3. Double check to see if enough memory has been given for nodes
(~64-128GB for master/edge nodes, ~256-512GB for data/workers nodes).
webinar
41. On-Premises
(real hardware somewhere in your building or data center)
1. Highest data privacy (Regulations and sensitive data).
2. Quickest access to data (Latency).
3. Best velocity (Transfer rates).
4. Existing Hardware.
5. Control over resource usage.
webinar
43. Hybrid
1. Hybrid: a combination of on-premises and cloud.
2. On-premises: sensitive information and data for high-performance access.
3. Cloud: non-sensitive data.
webinar
44. On-Premises vs. Cloud
1. Oracle ExaData ~ $1.000.000
2. Biggest instance in Amazon EC2 (40CPU) ~ 50 years!
webinar
45. On-Premises vs. Cloud: Checklist
1. On-premises: If customer has existing unused hardware, has predicted data volume
growth, or has strong data security requirements.
2. Cloud: If the customer doesn’t have a large budget, is not sure about data & load
growth, and doesn’t have strong security requirements or a team of engineers to
support hardware.
3. Hybrid: Mixture of requirements above.
webinar
47. Dedicated Clusters
Visualization Service
Data Ingestion Service
Analytics Service
VM1 VM2 VM3
VM1 VM2 VM2
VM4 VM5 VM6
VM7 VM8
• Configuration and
management of 3
separate clusters.
• Resources stay idle if
service is not active.
• Need to move data
between clusters for
each service.
webinar
48. Shared Clusters
Visualization Service
Data Ingestion Service
Analytics Service
Multiple clusters
Multiple clusters
...to maximize utilization
...to share data between services
webinar
50. Shared Clusters: Mesos/Docker
Maximize utilization & performance:
Deliver more services with smaller footprint.
Shared clusters for all services:
Easier deployment and management with unified service platform.
Shared data between services:
Faster and more competitive services and solutions.
webinar
51. How Does this Work?
Zookeeper quorum
Mesos Master Mesos Master Mesos Master
Spark Service Scheduler Marathon Service Scheduler
Mesos Slave
Spark Task Executor Mesos Executor
Mesos Slave
Docker Executor Docker Executor
Task #1 Task #2 ./python XYZ java -jar XYZ.jar ./xyz
52. How Does this Work?
Mesos provides fine grained resource isolation
Mesos Slave Process
Spark Task Executor Mesos Executor
Task #2 ./python XYZ
Compute Node
Executor
Container
(cgroups)
Task #1
webinar
53. How Does this Work?
Mesos provides scalability
Mesos Slave Process
Spark Task Executor
Task #2
Compute Node
Container
(cgroups)
Task #1
Python executor finished,
more available resources,
and more spark.
Task #4Task #3
webinar
54. How Does this Work?
VM5VM1 VM2 VM3 VM4
Mesos has no single point of failure Services keep running if VM fails!
Mesos Master
Mesos Master Mesos Master
webinar
55. How Does this Work?
VM5VM1 VM2 VM3 VM4
Master node can failover Services keep running if Mesos Master fails!
Mesos Master
Mesos Master Mesos Master
webinar
56. How Does this Work?
Slave process can failover Tasks keep running if Mesos Slave Process fails!
Mesos Slave Process
Spark Task Executor
Task #2
Compute Node
Task #1 Task #4Task #3
webinar
57. Scalability & Performance: Checklist
1. If you need real scalability then use shared clusters.
2. Shared clusters love to host in Cloud.
3. Scalability means performance (in most cases). Use it as a synonym.
webinar
63. Storage Comparison
1. Amazon S3: universal access, cheap, and data needs to be copied before processing.
2. HDFS: compatible with Hadoop ecosystem, relatively cheap, and data can be processed
where it is being stored.
3. Directly Attached Storage/Network Attached Storage: expensive, fastest access to data,
and it also can be processed where data is being stored.
webinar
64. Storage: Checklist
1. If you need unified access to data and use some universal Cloud FS,
then this would be similar to Amazon S3.
2. For immediate access to data (OLTP system), you need Directly
Attached Storage (DAS), Network Attached Storage (NAS), Elastic Block
Storage (Amazon EBS), and so on.
3. If you choose NoSQL, you’ll need much more space than actual data
(each query might require duplicate copy of data).
4. Pick storage carefully and use PoC/Prototyping, otherwise changing
storage later on will be hard to almost impossible.
webinar
66. Final Checklist
1. You’re the Boss!
2. You have a right to demand the infrastructure you need.
3. However, you need to have perfect argumentation.
4. Now you have it and know where to get details.
5. Good luck and see you in the field!
webinar