Mais conteúdo relacionado Semelhante a Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success (20) Big Data, Big Projects, Big Mistakes: How to Jumpstart and Deliver with Success1. © ALTOROS Systems | CONFIDENTIAL
Andrei Yurkevich
Chief Technology Officer
andrei.yurkevich@altoros.com
2. © ALTOROS Systems | CONFIDENTIAL 2
• Hadoop/NoSQL performance engineering
• Cluster Automation & Server Templates on Joyent, AWS, SoftLayer, Rackspace,
CloudStack and OpenStack using Chef/Puppet, RightScale and SCALR
• 300+ employees globally (UK, USA, Denmark, Switzerland, Norway, Belarus,
Argentina)
• v
Featured customers Partners
8. © ALTOROS Systems | CONFIDENTIAL 8
No clear business goals
Big amounts of data
from many sources
Architecture design
The variety of tools
Compatibility of technologies/platforms
Lack of professionals
All features in one release
Budget
10. © ALTOROS Systems | CONFIDENTIAL 10
Functional requirements Value Non-functional requirements
The amount of data added daily: 2.5 TB
• Infrastructure-independent
architecture
• Scalability
• Open-source tools
Data type: raw data
processed
data
Data storage time:
raw data
Processed data
min a week
min a year
Response time:
for building reports based on a
pre-set template
for building reports for a
custom period of time
< 30 sec
< 6 hours
Uptime: 99%
Fault-tolerance: required
Deployment cost per day: < $1,000
11. © ALTOROS Systems | CONFIDENTIAL 11
Amazon AWS Joyent Rackspace
Types of a contract On Demand, Reserved,
Spot
On Demand,
Reserved
On Demand
Types of instances
(classified by compute
units)
• General Purpose
• Compute optimized
• Memory optimized
• Storage optimized
• Standard
• High Memory
• High CPU
• High Storage
• High I/O
• General Purpose
Storage options • EBS
• S3
• Low-cost storage
• Network storage
based on ZFS
• Cloud Block
Storage
• Cloud Files
Operating systems Linux, Windows SmartOS, Linux,
Windows
Linux, Windows
A management
console
AWS Console Joyent
SmartDataCenter
Cloud Control Panel
A Cloud API • Command line
interface
• Java, .NET, Ruby
SDK and API
• Command line
interface (CLI)
• Node.js SDK
• REST API
REST API
Regions America, Europe, Asia,
Australia
North America,
Europe
America, Europe, Asia,
Australia
Estimated cost per
month
$18,300 $17,500 $21,350
12. © ALTOROS Systems | CONFIDENTIAL 12
a good fit a normal fit a bad fit
Option 2 Option 1
Feature Amazon AWS Joyent Rackspace
Types of a contract On Demand, Reserved,
Spot
On Demand, Reserved On Demand
Types of instances
(classified by compute
units)
• General Purpose
• Compute optimized
• Memory optimized
• Storage optimized
• Standard
• High Memory
• High CPU
• High Storage
• High I/O
• General Purpose
Storage options • EBS
• S3
• Low-cost storage
• Network storage
based on ZFS
• Cloud Block Storage
• Cloud Files
Operating systems Linux, Windows SmartOS, Linux,
Windows
Linux, Windows
A management console AWS Console Joyent SmartDataCenter Cloud Control Panel
A Cloud API • Command line
interface
• Java, .NET, Ruby
SDK and API
• Command line
interface (CLI)
• Node.js SDK
• REST API
REST API
Regions America, Europe, Asia,
Australia
North America, Europe America, Europe, Asia,
Australia
Estimated cost per month $18,300 $17,500 $21,350
Score 1.5 3.5
13. © ALTOROS Systems | CONFIDENTIAL 13
Features HBase Cassandra MongoDB MySQL Cluster
License Apache Apache AGPL GPL
Protocol HTTP/REST (also
Thrift)
Thrift and custom
binary CQL3
Custom, binary
(BSON)
JDBC, ODBC
Data model Column family Column family JSON documents Tables
Queries / Query
Language
JRuby-based
(JIRB) shell
Cassandra Query
Language
JavaScript
expressions
SQL
Partitioning
Strategy
Ordered
Partitioning
Random
Partitioning
Sharding by key Partition by key
Replication
between nodes
yes yes yes yes
Replication
between data
centers
no
yes
no
yes
Capability to store
2.5 TB daily
yes yes yes yes
Implementation
Experience
1+ 1+ 2+ 5+
Score 2 3 2 5
a good fit a normal fit a bad fit
14. © ALTOROS Systems | CONFIDENTIAL 14
Features HBase Cassandra MongoDB MySQL Cluster
License Apache Apache AGPL GPL
Protocol HTTP/REST (also
Thrift)
Thrift and custom
binary CQL3
Custom, binary
(BSON)
JDBC, ODBC
Data model Column family Column family JSON documents Tables
Queries / Query
Language
JRuby-based
(JIRB) shell
Cassandra Query
Language
JavaScript
expressions
SQL
Partitioning
Strategy
Ordered
Partitioning
Random
Partitioning
Sharding by key Partition by key
Replication
between data
centers
no
yes
no
yes
Capability to store
2.5 TB daily
yes yes yes yes
Implementation
Experience
1+ 1+ 2+ 5+
Deployment cost
per day
$450 $400 $500 $1,500
Score 2.5 4 2.5 0
a good fit a normal fit a bad fit
16. © ALTOROS Systems | CONFIDENTIAL 16
Feature HBase Cassandra MongoDB
Replication between data
centers
Asynchronous,
needs testing
Replicas can span
data centers with
synchronous
replication
Not supported
A cluster admin node NameNode Any node mongos process
Implementation
Experience
1+ 1+ 2+
Time spent on inserting
30 MB of data
7 sec 9 sec 20 sec
Deployment cost per day $450 $400 $500
Score 2 2.5 0
a good fit a normal fit a bad fit
19. © ALTOROS Systems | CONFIDENTIAL 19
A requirement The prototype features
Storing of 2.5 TB of daily raw data for a week Capable
Storing of 1.5 TB of processed data for a year Capable
Response time for building reports based on a pre-set
template
~25 sec
Response time of less than 6 hours for building a custom
report
~7 hours
Scalability Good
Infrastructure Independence Yes
Using open-source tools For all components
Fault-tolerance Yes
Deployment cost per day < $1,000 ~$600
20. © ALTOROS Systems | CONFIDENTIAL
Properly visualize and test the
functionality
Detect bottlenecks and change a
technology/tool/database before it
was implemented in the real system
Get a real vision of the final solution
Make sure you stick to the budget
20
21. © ALTOROS Systems | CONFIDENTIAL 21
Andrei Yurkevich
President/CTO
andrei.yurkevich@altoros.com
Notas do Editor VolumeVelocityVarietyWhere to start? Everything seemed to be smooth. However, there was just one slight detail about MySQL Cluster. Its architecture requires putting all data into RAM, so we needed a cluster that would have 2.5 TB of RAM. The actual deployment cost was about $500 up the budget. So, we had to start from scratch again. HBase was 2 seconds faster than Cassandra but what about fault tolerance? HBase has additional node that serves as a coordinator for the entire system. If it fails – the system fails. Surely we can add a secondary management node, but then we may exceed the budget. Cassandra has decentralized architecture it means that all nodes of its cluster have equal roles and every node can serve as a coordinator. It makes this database extremely fault tolerant. raw data – is all data that comes from sensorsprocessed data – is the data that was aggregated for each 10 minutes. This data is used for building reports.