SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Why Migrate from MySQL to Cassandra?



           WHITE PAPER




           By DataStax Corporation
           July 2012
Contents
Introduction ..................................................................................................................................... 3	
  
Why Stay with MySQL? ................................................................................................................... 4	
  
Why Migrate from MySQL? ............................................................................................................. 4	
  
    Architectural Limitations .............................................................................................................. 6	
  
    Data Model Limitations ............................................................................................................... 6	
  
    Scalability and Performance Limitations ..................................................................................... 7	
  
Why Migrate to Cassandra and DataStax Enterprise? .................................................................... 7	
  
    A Technical Overview of Cassandra ........................................................................................... 7	
  
    Cassandra vs. Other NoSQL Solutions .................................................................................... 10	
  
    Who’s Using Cassandra? ......................................................................................................... 10	
  
    A Quick Look at DataStax ......................................................................................................... 11	
  
    DataStax Enterprise – The Choice for Production Big Data Deployments ............................... 11	
  
    What About Cost? ..................................................................................................................... 13	
  
How to Migrate from MySQL to Cassandra ................................................................................... 13	
  
    Using Sqoop to Migrate from MySQL ....................................................................................... 14	
  
    Using Pentaho Kettle to Migrate from MySQL .......................................................................... 15	
  
Examples of Customers Who Have Switched from MySQL .......................................................... 16	
  
    Mahalo ...................................................................................................................................... 16	
  
    Pantheon Systems .................................................................................................................... 17	
  
    Formspring ................................................................................................................................ 17	
  
Conclusion ..................................................................................................................................... 18	
  
About DataStax ............................................................................................................................. 18	
  
Appendix A – FAQ on Switching from MySQL to DataStax Enterprise/Cassandra ...................... 19	
  
Appendix B – Comparing MySQL and DataStax Enterprise/Cassandra ....................................... 21	
  
    General Comparisons ............................................................................................................... 21	
  
    Datatype Comparisons ............................................................................................................. 23	
  




© 2012 DataStax. All rights reserved.                                                                                                                2
Introduction
Founded in 1995, MySQL has been one of the de facto infrastructure pieces in applications that
target the Web. The database component of the Internet LAMP stack, MySQL has enjoyed wide
success in adoption. This is for good reason: MySQL provides solid relational database
management system (RDBMS) capabilities in an open source package that enables companies
to build systems that perform well from a database perspective in many general purpose use
cases.

MySQL was acquired by Sun Microsystems in 2008, and was formally acquired by Oracle in
January 2010 (via the company’s acquisition of Sun Microsystems). Now part of Oracle’s stable
of database products, MySQL continues to be promoted and sold through Oracle.

In addition to Oracle, a number of other vendors have either forked MySQL to create another
database offering or are using MySQL as part of a specialized service offering. Examples include
Percona, Monty Program AB, Infobright, Calpont, and Amazon RDS.

While Oracle’s MySQL remains a good RDBMS that performs well for the use cases it was
designed for, even its strongest supporters admit that it is not architected to tackle the new wave
of big data applications being developed today. In fact, in the same way that the needs of late
20th century web companies helped give birth to MySQL and drive its success, modern
businesses today that need to manage big data use cases are helping to forge a different set of
technologies that are replacing MySQL in many situations.

This paper examines the why’s and how’s of migrating from Oracle’s MySQL to these new big
data technologies, such as Apache Cassandra™ and Apache Hadoop™. It also looks at the
benefits of moving from MySQL to a fully integrated data stack that combines those technologies
and others together into a single big data platform like that found in DataStax Enterprise.




© 2012 DataStax. All rights reserved.                                                                 3
Why Stay with MySQL
Before continuing with a discussion on why it might be necessary to either develop new
applications or migrate existing MySQL systems to another database platform, it first makes
sense to understand why staying with MySQL as a database service may be the right choice.
Because database migrations in particular can be resource-intensive, IT professionals should
ensure they are making the right decision before making such a move.

Oracle’s MySQL and MySQL Cluster are good tools that satisfy a wide range of use cases. In
general, IT managers should consider staying with MySQL if either their existing system or
planned new application exhibits the following characteristics and needs:

•   ACID-compliant transactions, with nested transactions, commits/rollbacks, and full referential
    integrity required.
•   A very normalized data model that is well served by the Codd-Date relational design, and one
    where join operations cannot be avoided.
•   Data is primarily structured with little to no unstructured or semi-structured data being
    present.
•   Low to moderate data volumes that can be handled easily by the MySQL optimizer.
•   Telco applications that require use of main memory solutions and whose data is primarily
    accessed via primary keys.
•   Scale out architectures that are primarily read in nature, with no need to write to multiple
    masters or servers that exist in many different cloud zones or geographies.
•   No requirement for a single database/cluster to span many different data centers.
•   High availability requirements can be accomplished via a synchronous replication
    architecture that is primarily maintained at a single data center.

If an application presents these and similar requirements, then Oracle’s MySQL may indeed be a
good fit as a database platform.


Why Migrate from MySQL?
One primary reason why various IT organizations are either already migrating away from Oracle’s
MySQL or planning to do so is the very visible rise of big data applications – and specifically, big
data online transaction processing (OLTP) applications. These companies either have existing
systems morphing into big data systems, or they are planning new applications that are big data
in nature and need something “more” than MySQL for their database platform.

Although the top industry IT analyst groups may disagree on various technical and marketplace
trends, they are, remarkably, in agreement on the following three things: (1) the upsurge of big
data applications; (2) the definition of what constitutes big data, and (3) the need for new
technology to deal with big data.




© 2012 DataStax. All rights reserved.                                                                  4
The momentum and growth of big data applications is unmistakable. Underscoring this is a recent
survey of 600 IT professionals that revealed nearly 70 percent of organizations are now
                                                                   1
considering, planning, or running big data projects.

As for a definition of big data, it is nearly universally agreed that big data involves one or all of the
following:

1. Velocity – data coming in at extreme rates of speed.
2. Variety – the types of data needing to be captured (structured, semi-structured, and
       unstructured).
3. Volume – sizes that potentially involve terabytes to petabytes of data.
4. Complexity – involves everything from moving operational data into big data platforms,
       maintaining various data “silos,” and the difficulty in managing data across multiple sites and
       geographies.

Although they may phrase it differently, top analyst groups and market observers agree that to
tackle big data applications that have the above characteristics, something more than standard
RDBMSs, like Oracle’s MySQL, is needed.

For example, David Kellogg says big data is “too big to be reasonably handled by
                                            2
current/traditional technologies.” Consulting and research firm McKinsey & Company agrees with
Kellogg’s concept of big data and defines it as “datasets whose size is beyond the ability of
                                                                                              3
typical database software tools to capture, store, manage, and analyze.”

IDC says: “Big data technologies describe a new generation of technologies and architectures,
designed to economically extract value from very large volumes of a wide variety of data, by
                                                                             4
enabling high-velocity capture, discovery, and/or analysis.”

And finally, O’Reilly defines big data in this way: “Big data is data that exceeds the processing
capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain value from this data, you must choose an
                                        5
alternative way to process it.”

In particular, big data applications are nudging MySQL aside for other options. While MySQL
initially gained its popularity through use of the MyISAM storage engine, the InnoDB transactional




1
 “Survey: 70 Percent of Organizations Have Big Plans for Big Data,” by Pedro Hernandez, EntepriseAppsToday.com, May 14,
2012: http://www.enterpriseappstoday.com/data-management/survey-70-percent-enterprises-big-plans-for-big-data.html.
2
 “‘Big data’ has jumped the shark,” DBMS2, September 11, 2011: http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-
shark/.
3
 Big Data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, May 2011, p. 11:
http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
4
    Extracting Value from Chaos, by John Gantz and David Reinsel, IDC, June 2011: http://idcdocserv.com/1142.
5
 “What Is Big Data? An Introduction to the Big Data Landscape,” by Edd Dumbill, O’Reilly Radar, January 11, 2012:
http://radar.oreilly.com/2012/01/what-is-big-data.html.


© 2012 DataStax. All rights reserved.                                                                                     5
engine is arguably the most used today. But InnoDB isn’t designed to handle the types of big data
requirements discussed above.

What types of limitations, bottlenecks, and issues are MySQL users experiencing? Although the
exact situations vary, a few of the most prevalent reasons that cause a move from MySQL are as
follows:


Architectural Limitations
One reason modern businesses are switching from Oracle’s MySQL to big data platforms is
because the underlying architecture does not support key big data use cases. This is true
regardless of which MySQL products are being considered –MySQL Community/Enterprise,
MySQL Cluster, or database services such as Amazon RDS.

Some of the architectural issues that arise when MySQL is thrown into big data situations include:

•   The traditional master-slave architecture of MySQL (one write master with 1-n slaves)
    prohibits “location independent” or “read/write anywhere” use cases that are very common in
    big data environments where a database cluster is spread out throughout many different
    geographies and data centers (and the cloud), with each node needing to support both reads
    and writes.
•   The necessity to manually shard (i.e., partition) general MySQL systems to overcome various
    performance shortcomings becomes a very time-consuming, error-prone, and expensive
    proposition to support. It also places a heavy burden on development staff to support
    sharding logic in the application.
•   Failover and failback situations tend to require manual intervention with generally replicated
    MySQL systems. Failback can be especially challenging.
•   Although it provides automatic sharding and supports simple geographic replication, MySQL
    Cluster’s dependence on synchronous replication can cause latency and transactional
    response time issues. Further, its geographic replication does not support multiple (i.e., >2)
    data centers in a way that either performs well or is easy to manage.
•   Database services such as Amazon’s RDS suffer from the same shortfalls above as Amazon
    only supports either a simple standby server that is maintained in a different availability zone
    in Amazon’s cloud, or a series of read replicas that are provisioned and used to help service
    increased query (not write) traffic.


Data Model Limitations
A big reason why many businesses are moving to NoSQL-based solutions is because the legacy
RDBMS data model is not flexible enough to handle big data use cases that contain a mixture of
structured, semi-structured, and unstructured data. While MySQL has good datatype support for
traditional RDBMS situations that deal with structured data, it lacks the dynamic data model
necessary to tackle high-velocity data coming in from machine-generated systems or time series
applications, as well as cases needing to manage semi-structured and unstructured data.


© 2012 DataStax. All rights reserved.                                                                  6
Recently, Oracle announced it had introduced a NoSQL-type interface into its MySQL Cluster
product that is key/value in design. While certainly helpful in some situations, such a design still
falls short in key big data use cases like time series applications that require inserting data into
structures that support tens of thousands of columns.


Scalability and Performance Limitations
Oracle’s MySQL has long been touted as a scale-out database. However, those who know and
use MySQL admit it has limitations that negate its use in big data situations where scalability is
required. For example:

•   More servers can be added to a general MySQL Community/Enterprise cluster to help
    service more reads, but writes are still bottlenecked via the main write master server.
    Moreover, if many read slave servers are required, latency issues can arise in the process of
    simply getting the data from the master server to all the slaves.
•   Consumption of high-velocity data can be challenging, especially if the InnoDB storage
    engine is used, as the index-organized structure often does not handle high insert rates well.
    Third-party storage engine vendors, which are columnar in nature, typically cannot help in
    this case, as they rely on their proprietary high-speed loaders to load data quickly into a
    database.
•   Data volumes over half a terabyte become a real challenge for the MySQL optimizer. To
    overcome this, a third-party storage engine vendor such as Calpont or Infobright must be
    used – but these vendors have limitations either in their SQL support, MPP capabilities, or
    both.


Why Migrate to Cassandra and DataStax Enterprise?
While a move from Oracle’s MySQL may be necessary because of its inability to handle key big
data use cases, why should that move involve a switch to Apache Cassandra and DataStax
Enterprise?

The sections that follow describe why a move to Cassandra and DataStax Enterprise make both
technical and business sense for MySQL users seeking alternatives.


A Technical Overview of Cassandra
Apache Cassandra, an Apache Software Foundation project, is an open
source NoSQL distributed database management system. Cassandra is
designed to handle big data workloads across multiple data centers with no
single point of failure, providing enterprises with continuous availability
without compromising performance.




© 2012 DataStax. All rights reserved.                                                                  7
In selecting an alternative to Oracle’s MySQL, IT professionals will find Apache Cassandra is a
standout among other NoSQL offerings for the following technical reasons:

•   Massively scalable architecture – Cassandra’s masterless, peer-to-peer architecture
    overcomes the limitations of master-slave designs and allows for both high availability and
    massive scalability. Cassandra is the acknowledged NoSQL leader when it comes to
    comfortably scaling to terabytes or petabytes of data, while maintaining industry-leading write
    and read performance.
•   Linear scale performance – Nodes added to a Cassandra cluster (all done online) increase
    the throughput of a database in a predictable, linear fashion for both read and write
    operations – even in the cloud, where such predictability can be difficult to ensure.
•   Continuous availability – Data is replicated to multiple nodes in a Cassandra database
    cluster to protect from loss during node failure and to provide continuous availability with no
    downtime.
•   Transparent fault detection and recovery – Cassandra clusters can grow into the
    hundreds or thousands of nodes. Because Cassandra was designed for commodity servers,
    machine failure is expected. Cassandra utilizes gossip protocols to detect machine failure
    and recover when a machine is brought back into the cluster – all without the application
    noticing.
•   Flexible, dynamic schema data modeling – Cassandra offers the organization of a
    traditional RDBMS table layout combined with the flexibility and power of no stringent
    structure requirements. This allows data to be stored dynamically as needed without
    performance penalty for changes that occur. In addition, Cassandra can store structured,
    semi-structured, and unstructured data.
•   Guaranteed data safety – Cassandra far exceeds other systems on write performance due
    to its append-only commit log while always ensuring durability. Users must no longer trade off
    durability to keep up with immense write streams. Data is absolutely safe in Cassandra; data
    loss is not possible.
•   Distributed, location independence design – Cassandra’s architecture avoids the hot
    spots and read/write issues found in master-slave designs. Users can have a highly
    distributed database (e.g., multi-geography, multi-data center) and read or write to any node
    in a cluster without concern over which node is being accessed.
•   Tunable data consistency – Cassandra offers flexible data consistency on a cluster, data
    center, or individual I/O operation basis. Very strong or eventual data consistency among all
    participating nodes can be set globally and also controlled on a per-operation basis (e.g., per
    INSERT, per UPDATE).
•   Multi-data center replication – Whether it’s keeping data in multiple locations for disaster
    recovery scenarios or locating data physically near its end users for fast performance,
    Cassandra offers support for multiple data centers. Administrators simply configure how
    many copies of the data they want in each data center, and Cassandra handles the rest –
    replicating the data automatically. Cassandra is also rack-aware and can keep replicas of


© 2012 DataStax. All rights reserved.                                                                 8
data stored on different physical racks, which helps ensure uptime in the case of single rack
    failures.
•   Cloud-enabled – Cassandra’s architecture maximizes the benefits of running in the cloud.
    Also, Cassandra allows for hybrid data distribution where some data can be kept on-premise
    and some in the cloud.
•   Data compression – Cassandra supplies built-in data compression, with up to an 80 percent
    reduction in raw data footprint. More importantly, Cassandra’s compression results in no
    performance penalty, with some use cases showing actual read/write operations speeding up
    due to less physical I/O being required.
•   CQL (Cassandra Query Language) – Cassandra provides a SQL-like language called CQL
    that mirrors SQL’s DDL, DML, and SELECT syntax. CQL greatly decreases the learning
    curve for those coming from RDBMS systems because they can use familiar syntax for all
    object creation and data access operations.
•   No caching layer required – Cassandra offers caching on each of its nodes. Coupled with
    Cassandra’s scalability characteristics, nodes can be incrementally added to the cluster to
    keep as much data in memory as needed. The result is that there is no need for a separate
    caching layer.
•   No special hardware needed – Cassandra runs on commodity machines and requires no
    expensive or special hardware.
•   Incremental and elastic expansion – The Cassandra ring allows online node additions.
    Because of Cassandra’s fully distributed architecture, every node type is the same, which
    means clusters can grow as needed without any complex architecture decisions.
•   Simple install and setup – Cassandra can be downloaded and installed in minutes, even for
    multi-cluster installs.
•   Ready for developers – Cassandra has drivers and client libraries for all the popular
    development languages (e.g., Java, Python).

Given these technical features and benefits, the following are typical big data use cases handled
well by Cassandra in the enterprise:

•   Big data situations
•   Time series data management
•   High-velocity device data ingestion and analysis
•   Healthcare system input and analysis
•   Media streaming management (e.g., music, movies)
•   Social media (i.e., unstructured data) input and analysis
•   Online web retail (e.g., shopping carts, user transactions)
•   Real-time data analytics
•   Online gaming (e.g., real-time messaging)
•   Software as a Service (SaaS) applications that utilize web services
•   Write-intensive systems



© 2012 DataStax. All rights reserved.                                                               9
Cassandra vs. Other NoSQL Solutions
What does the performance of Cassandra
look like compared to other NoSQL options?
While each use case is different, external
                                          6
benchmarks such as the YCSB test show
Cassandra to outperform its rivals in a
number of situations.

This particular benchmark (right) shows
Cassandra delivering nearly 4x the write
performance, 2x the read performance, and
better than 12x overall performance in a
mixed workload use case over another
leading NoSQL provider.


Who’s Using Cassandra?
One benefit MySQL users have enjoyed is a large community of users who have deployed the
database in many production environments. Cassandra, likewise, is used in many industries for
modern applications that need scale, fast performance, data flexibility, and easy data distribution.
Below is a snapshot of some of the companies and organizations that use Cassandra in
production. (Note: There are many other household name companies with production
implementations that cannot be published due to NDA restrictions.)




6
 “NoSQL Benchmarking,” CUBRID: http://blog.cubrid.org/dev-platform/nosql-
benchmarking/?utm_source=NoSQL+Weekly+List&utm_campaign=143fae86b2-
NoSQL_Weekly_Issue_41_September_8_2011&utm_medium=email.


© 2012 DataStax. All rights reserved.                                                             10
A Quick Look at DataStax
A viable company behind an open source offering is vital for enterprises wanting to enjoy the
benefits provided by open source software, but also needing the professional requirements
supplied by proprietary software offerings.




DataStax is the leading provider of modern enterprise database software products and services
based on Apache Cassandra. It employs the Apache chair of Cassandra as well as most of the
project’s committers. At the time of this writing, DataStax has nearly 50 employees and over 170
customers, though both of these statistics are growing rapidly.

DataStax provides free and open source NoSQL products as well as commercial solutions aimed
at production big data environments, with its flagship solution being DataStax Enterprise.


DataStax Enterprise – The Choice for Production Big Data
Deployments
Apache Cassandra can be likened to MySQL’s
community server in that both are free and open source.
By contrast, DataStax Enterprise is more akin to MySQL
Enterprise or MySQL Cluster’s Carrier Grade editions in
that it is designed for production deployments that power
key business systems.

DataStax Enterprise is tailor-made to manage big data
effectively. The solution inherits Cassandra’s entire,
powerful feature set for servicing modern big data
operational applications, and smartly integrates a fault-



© 2012 DataStax. All rights reserved.                                                           11
tolerant analytics platform that provides Hadoop™ MapReduce, Hive, Pig, Mahout, and Sqoop
support for business intelligence systems. It also includes enterprise search capabilities via
Apache Solr™, which is the most popular open source search software in use today.

A key differentiator of DataStax Enterprise over other big data providers is that real-time, analytic,
and search workloads are intelligently isolated across a distributed DataStax Enterprise database
cluster, so that no competition for underlying compute resources or data occurs.

DataStax Enterprise is comprised of three components:

1. The DataStax Enterprise Server – built on Apache Cassandra, the server manages real-
    time data with Cassandra, analytic data with Hadoop, and enterprise search data with
    Apache Solr.
2. OpsCenter Enterprise – a visual, browser-based solution for managing and monitoring
    Cassandra and the DataStax Enterprise server.
3. Production Support – full 24x7x365 support from the big data experts at DataStax.


With Hadoop and Solr, the types of use cases that can be tackled with DataStax Enterprise grow
exponentially beyond those previously covered with Cassandra alone and include:

•   Social media input and analysis
•   Web clickstream analysis
•   Buyer event and behavior analytics
•   Fraud detection and analysis
•   Risk analysis and management
•   Supply chain analytics
•   Web product searches
•   Internal document search (e.g., law firms)
•   Real estate/property searches
•   Social media matchups
•   Web and application log management/analysis




© 2012 DataStax. All rights reserved.                                                              12
What About Cost?
There are many technical benefits in moving from Oracle’s MySQL to Cassandra and DataStax
Enterprise, but what about cost? How does DataStax compare with Oracle in that regard?

As of May 2012, the list prices for Oracle’s MySQL products were priced by subscription, per
                                            7
server/socket, and were as follows :

Product                                                                                     List Price

MySQL Standard Edition Subscription (1-4 socket server)                                          $2,000

MySQL Standard Edition Subscription (5+ socket server)                                           $4,000

MySQL Enterprise Edition Subscription (1-4 socket server)                                        $5,000

MySQL Enterprise Edition Subscription (5+ socket server)                                       $10,000

MySQL Cluster Carrier Grade Edition Subscription (1-4 socket server)                           $10,000

MySQL Cluster Carrier Grade Edition Subscription (5+ socket server)                            $20,000



For feature differences in the MySQL products listed above, see http://www.mysql.com/products/.

Like the MySQL Community Edition, DataStax provides a free edition of Apache Cassandra (the
DataStax Community Edition), which comes with the latest version of Cassandra, a free version
of DataStax’s OpsCenter management and monitoring tool, and quick-start developer aids.

From a production, enterprise perspective, a subscription to DataStax Enterprise Edition may be
compared to the MySQL Enterprise and MySQL Cluster Carrier Grade Edition subscriptions.
DataStax does not currently price by socket – only by server, with pricing either being on par with
MySQL Enterprise for small boxes or, with beefier machines, substantially less (e.g., 50 to 70
percent).


How to Migrate from MySQL to Cassandra
The first step in migrating from MySQL to Cassandra is to understand that data modeling is
handled differently in NoSQL solutions vs. RDBMSs. In traditional databases such as MySQL,
data is modeled in standard “third normal form” design without the need to know what questions
will be asked of the data. By contrast, in NoSQL, the questions asked of the data are what drive
the data model design and the data is highly denormalized.




7
 “MySQL Global Price List, Software Investment Guide,” Oracle, November 1, 2010:
http://www.oracle.com/us/corporate/pricing/price-lists/mysql-pricelist-183985.pdf.


© 2012 DataStax. All rights reserved.                                                             13
If a developer is simply interested in porting MySQL schema and data over into a Cassandra
keyspace (analogous to a database in MySQL) just for testing or preliminary development
purposes, there are two primary options:

1. Use the Sqoop interface to move MySQL tables and data.
2. Use ETL (extract-transform-load) tools such as Pentaho’s Kettle to move schema and data.


Using Sqoop to Migrate from MySQL
DataStax Enterprise supports Sqoop, which is a utility designed to transfer data between an
RDBMS and Hadoop. Given that DataStax Enterprise combines Cassandra, Hadoop, and Solr
together into one big data platform, a developer can move data not only to a Hadoop system with
Sqoop, but also Cassandra.

The DataStax Enterprise installation package includes a sample/demo of how to move MySQL
schema and data into Cassandra. Because Sqoop works via Java Database Connectivity
(JDBC), the only prerequisite is that the JDBC driver for MySQL must be downloaded from the
MySQL website and placed in a directory where Sqoop has access to it (the /sqoop subdirectory
of the main DataStax Enterprise installation is recommended).

Each MySQL table is mapped to a Cassandra column family. Column families are a Google
Bigtable structure, with rows and columns like MySQL but much more dynamic and flexible. The
migration is done via a command line utility that accepts a number of different parameters.

For example, the following code migrates a MySQL table contained on a server with IP address
127.0.0.1 that’s in the dev database. It connects to MySQL with the root ID and uses no
password. It migrates a table called npa_nxx from MySQL into a Cassandra keyspace named
dev, to a column family named npa_nxx_cf, identifies the MySQL column npa_nxx_key as the
primary key, names the Cassandra server’s host IP, and lastly, asks for the schema to be created
before the data is imported:

./dse sqoop import --connect jdbc:mysql://127.0.0.1/dev 

        --username root                                                     

        --table npa_nxx                                                     

        --cassandra-keyspace dev                                            

        --cassandra-column-family npa_nxx_cf                                

        --cassandra-row-key npa_nxx_key                                     

        --cassandra-thrift-host 127.0.0.1                                   

        --cassandra-create-schema




© 2012 DataStax. All rights reserved.                                                         14
Figure 1 – Using Sqoop to move data from MySQL to DataStax Enterprise


Using Pentaho Kettle to Migrate from MySQL
Another way to migrate MySQL tables and data to Cassandra is by using a number of ETL tools
on the market such as Pentaho’s Data Integration product, also known as Kettle. Pentaho makes
two editions of their ETL tool available: a free community edition and a paid enterprise edition.
For core ETL tasks such as moving MySQL schema and data to Cassandra, the community
edition should provide everything that is needed.




Figure 2 – Kettle’s visual interface for performing ETL operations




© 2012 DataStax. All rights reserved.                                                               15
Pentaho’s Kettle product provides an easy-to-use graphical user interface (GUI) that allows
developers to visually design their MySQL migration tasks. Unlike the Sqoop utility, which just
does extract-load, Pentaho’s product allows developers to create simple to sophisticated
transformation routines to customize how a MySQL schema and data are moved to Cassandra.
In addition, the data movement engine of Kettle is quite efficient, so medium to semi-large data
volumes can be moved in a high-performance manner.

More information about Kettle and free downloads can be found at: http://kettle.pentaho.com/.


Examples of Customers Who Have Switched from
MySQL
Whether it’s by using Sqoop, ETL tools, or other in-house-developed options, a move from
Oracle’s MySQL to big data platforms like Cassandra and DataStax Enterprise is not difficult at
all. DataStax has many customers who have made just such a switch, but do not discuss it from
an external-facing standpoint. However, the following are a few customers who were kind enough
to let us share their use cases publicly.




Mahalo
Mahalo is a social media learning business that has a top 200 web ranking and experiences 12
million visits per month. Mahalo needed a database provider to manage the activity log for every
single customer interaction as well as store information for their Q&A topic database.

Mahalo started using Oracle’s MySQL. However, performance and availability issues
necessitated a move to a database that could scale and handle their heavy write workload.
Further, the company needed a more dynamic data model to store the variety of data that was
coming in.

Mahalo chose Cassandra and DataStax as their MySQL replacement. Mahalo’s chief technology
officer (CTO), Jason Burch, says: “With the Cassandra conversion completed and running
smoothly, we’re now free to focus on our primary mission, knowing that we’ll be able to deliver the
excellent responsiveness and capabilities that our user community has come to expect.”




© 2012 DataStax. All rights reserved.                                                              16
Pantheon Systems
Pantheon Systems provides a cloud-based web development platform for websites made with
Drupal. Pantheon needed a primary database platform that would hold all metadata information
that supports their main application platform and also all of their media storage.

Pantheon began with MySQL, but found it was unable to handle its requirements of managing
both structured and unstructured data. Moreover, MySQL could not scale or support Pantheon’s
need for continuous availability across multiple data centers.

Pantheon switched to Cassandra and DataStax, which met all of their specific requirements.
David Strauss, Pantheon’s CTO, describes the company’s use of Cassandra this way: “All the
actual platform data in Pantheon is persisted primarily to Cassandra. We could wipe out pretty
much everything on Pantheon, but as long as the Cassandra store is there, we have our data.”




Formspring
Formspring is a social media provider that serves as a group knowledge interaction site.
Formspring has over 26 million registered members worldwide who provide more than 10 million
responses daily, and the company receives over 30 million unique visitors per month.

Formspring’s initial database platform of MySQL and Amazon’s SimpleDB could not scale, so the
company turned to DataStax and Cassandra instead. Cassandra proved to be the only solution
Formspring evaluated that could produce the response times and continuous availability that the
company’s system needed.

Kyle Ambroff, senior engineer at Formspring, says: “I can safely say that if we had tried to
implement some of Formspring’s new features without using Cassandra, we would’ve had to
double the size of our operations staff just because we’d be adding more single points of failure.”




© 2012 DataStax. All rights reserved.                                                            17
Conclusion
There is no argument that Oracle’s MySQL is a good RDBMS – and one that well serves the use
cases for which it was originally designed. But for IT professionals who are either planning new
big data applications or have existing MySQL systems that have begun to break down under big
data workloads, a move to DataStax Enterprise and Cassandra makes both business and
technical sense.

Switching to a modern, big data platform like DataStax Enterprise will future-proof any
application, and provides confidence that the system will scale and perform well both now and
into a demanding future.

For more information on DataStax Enterprise and Cassandra, visit www.datastax.com.

For downloads of DataStax Enterprise – which may be freely used for development purposes –
visit http://www.datastax.com/download/enterprise.


About DataStax
DataStax offers products and services based on the popular open source database, Apache
Cassandra™ that solve today’s most challenging big data problems. DataStax Enterprise
combines the performance of Cassandra with analytics powered by Apache Hadoop and
enterprise search with Apache Solr, creating a smartly integrated, big data platform. With
DataStax Enterprise, real-time, analytic, and search workloads never conflict, giving you
maximum performance with the added benefit of only managing a single database.

The company has over 170 customers, including leaders such as Netflix, Disney, Cisco,
Rackspace, and Constant Contact, and spans verticals including web, financial services,
telecommunications, logistics and government. DataStax is backed by industry-leading investors,
including Lightspeed Venture Partners and Crosslink Capital, and is based in San Mateo, CA.

For more information, visit www.datastax.com.




© 2012 DataStax. All rights reserved.                                                              18
Appendix A – FAQ on Switching from MySQL to
DataStax Enterprise/Cassandra
This appendix supplies answers to frequently asked questions about migrating from Oracle’s
MySQL to DataStax Enterprise/Cassandra.

Do I lose transaction support when moving from MySQL to Cassandra?

MySQL’s InnoDB supplies ACID transaction support, whereas Cassandra provides AID
transaction support. The “C” or consistency part of transaction support does not apply to
Cassandra, as there is no concept of referential integrity or foreign keys in a NoSQL database.
There is also no concept of commit/rollback in Cassandra.

Batch operations are supported in Cassandra via the BATCH option in CQL.

What type of data consistency does DataStax Enterprise/Cassandra support?

DataStax Enterprise and Cassandra support “tunable data consistency.” This type of consistency
is the kind represented by the “C” in the CAP theorem, which concerns distributed systems.

Cassandra extends the concept of “eventual consistency” in NoSQL databases by offering
tunable consistency. For any given read or write operation, the client application decides how
consistent the requested data should be.

Consistency levels in Cassandra can be set on any read or write query. This allows application
developers to tune consistency on a per-query/operation basis depending on their requirements
for response time versus data accuracy. Cassandra offers a number of consistency levels for
both reads and writes.

What parts of my MySQL database cannot be migrated to DataStax Enterprise/Cassandra?

Schema, data, and general indexes may be migrated, but objects that currently cannot be
migrated include:
    •   Stored procedures
    •   Views
    •   Triggers
    •   Functions
    •   Security privileges
    •   Referential integrity constraints
    •   Rules
    •   Partitioned table definitions

Do I need to use a MySQL caching layer (like memcached) with Cassandra?
No. Cassandra negates the need for extra software caching layers like memcached through its
distributed architecture, fast write throughput capabilities, and internal memory caching
structures. When you want more memory cache available to your cluster, you simply add more
nodes and it will handle the rest for you.




© 2012 DataStax. All rights reserved.                                                             19
Is data absolutely safe in Cassandra?

Yes. First, data durability is fully supported in Cassandra, so any data written to a database
cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does.

Second, Cassandra offers tunable data consistency. This means a developer or administrator can
choose how strong they wish consistency across nodes to be. The strongest form of consistency
is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on
a node resulting in a failed data operation. Cassandra provides consistency in the CAP sense, in
that all readers will see the same values.

How is data written and stored in Cassandra?

Cassandra has been architected for consuming large amounts of data as fast as possible. To
accomplish this, Cassandra first writes new data to a commit log to ensure it is safe. After that,
the data is written to an in-memory structure called a memtable. Cassandra deems the write
successful once it is stored on both the commit log and a memtable, which provides the durability
required for mission-critical systems.

Once a memtable’s memory limit is reached, all writes are then written to disk in the form of an
SSTable (sorted strings table). An SSTable is immutable, meaning it is not written to ever again.
If the data contained in the SSTable is modified, the data is written to Cassandra in an upsert
fashion and the previous data automatically removed.

Because SSTables are immutable and only written once the corresponding memtable is full,
Cassandra avoids random seeks and instead only performs sequential I/O in large batches,
resulting in high write throughput.

A related factor is that Cassandra doesn’t have to do a read as part of a write (i.e., check index to
see where current data is). This means that insert performance remains high as data size grows,
while with b-tree based engines (e.g., MongoDB) it deteriorates.

What kind of query language is provided in Cassandra? Is it like SQL in MySQL?

Cassandra supplies the Cassandra Query Language (CQL), which is very SQL-like. Queries are
done via the standard SELECT command, while DML operations are accomplished via the familiar
INSERT, UPDATE, DELETE, and TRUNCATE commands. DDL commands such as CREATE are
used to create new keyspaces and column families.

Although CQL has many similarities to SQL, it does not change the underlying Cassandra data
model. There is no support for JOINs, for example.




© 2012 DataStax. All rights reserved.                                                                20
Appendix B – Comparing MySQL and DataStax
Enterprise/Cassandra
This technical appendix provides brief comparisons between Oracle’s MySQL and DataStax
Enterprise/Cassandra.


General Comparisons
 Feature/Function                 MySQL                            DataStax
                                                                   Enterprise/Cassandra

 Platform support                 Linux, Windows, Solaris, Unix,   Linux, Windows, Mac
                                  Mac

 Data model                       Relational/tabular               Google Bigtable

 Primary data object              Table                            Column family

 Data variety support             Primarily structured             Structured, semi-structured,
                                                                   unstructured

 Data partitioning/sharding       Manual in general MySQL;         Automatic
 model                            automatic in MySQL Cluster

 Logical database container       Database                         Keyspace

 Indexes                          Primary, secondary, clustered,   Primary, secondary
                                  full-text

 Distribution architecture        Master/slave or synchronous      Peer-to-peer with replication
                                  replication with MySQL Cluster

 CAP consistency model            Synchronous with MySQL           Tunable consistency per
                                  Cluster                          operation

 Multi-data center support        Basic                            Multi-data center and cloud, with
                                                                   rack awareness

 Transaction support              ACID                             AID (no “C” as there are no
                                                                   foreign keys)

 Memory usage model               General caches, query cache,     Distributed object/row caches
                                  main memory option with          across all nodes in a cluster



© 2012 DataStax. All rights reserved.                                                         21
Feature/Function                 MySQL                             DataStax
                                                                    Enterprise/Cassandra

                                  MySQL Cluster

 Core language                    SQL                               CQL

 Primary query utilities          mysql command line client         CQL shell; CLI

 Development language support Many (e.g., Java, Python)             Many (e.g., Java, Python)

 Large data volume support        Low TBs done with third-party     Native, TB-PB support
 (TBs+)                           storage engines

 Data compression                 Built into some storage engines   Built in

 Analytic support                 Some analytic functions           Done via Hadoop (MapReduce,
                                                                    Hive, Pig, Mahout)

 Search support                   Full-text indexes                 Done via Solr integration

 Geospatial support               Spatial extensions                Done via Solr integration

 Logging (e.g., web, application) Nothing built in                  Handled via log4j
 data support

 Mixed workload support           Must separate/ETL data            All handled in one cluster with
                                  between OLTP, analytic, search built-in workload isolation

 Backup/recovery                  Online, point-in-time restore     Online, point-in-time restore

 Enterprise                       MySQL Enterprise Monitor          DataStax OpsCenter
 management/monitoring




© 2012 DataStax. All rights reserved.                                                           22
Datatype Comparisons
 CQL datatype          MySQL datatype   Description

 blob                  blob             Arbitrary hexadecimal bytes (no validation)

 ascii                 char             US-ASCII character string

 text, varchar         text, varchar    UTF-8 encoded string

 varint                numeric          Arbitrary-precision integer

 int, bigint           int, bigint      8-bytes long

 uuid                  None             Type 1 or type 4 UUID

 timestamp             timestamp        Date plus time, encoded as 8 bytes since epoch

 boolean               bit              True or false

 float                 float            4-byte floating point

 double                double           8-byte floating point

 decimal               decimal          Variable-precision decimal

 counter               None             Distributed counter value (8-bytes long)




© 2012 DataStax. All rights reserved.                                                 23

Mais conteúdo relacionado

Mais procurados

Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?SnapLogic
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Igor De Souza
 
The SnapLogic Integration Cloud for ServiceNow
The SnapLogic Integration Cloud for ServiceNowThe SnapLogic Integration Cloud for ServiceNow
The SnapLogic Integration Cloud for ServiceNowSnapLogic
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake PracticeSamanthaSwain7
 
The Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management StackThe Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management StackSnapLogic
 
Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies SnapLogic
 
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the EnterpriseWebinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the EnterpriseSnapLogic
 
PgConf 2018 - Postgres in a World of DevOps
PgConf 2018 - Postgres in a World of DevOpsPgConf 2018 - Postgres in a World of DevOps
PgConf 2018 - Postgres in a World of DevOpsEDB
 
Business Intelligence in the Cloud I
Business Intelligence in the Cloud IBusiness Intelligence in the Cloud I
Business Intelligence in the Cloud IRightScale
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Postgres Vision 2018: The Changing Role of the DBA in the Cloud
Postgres Vision 2018: The Changing Role of the DBA in the CloudPostgres Vision 2018: The Changing Role of the DBA in the Cloud
Postgres Vision 2018: The Changing Role of the DBA in the CloudEDB
 
Webinar: The 5 Most Critical Things to Understand About Modern Data Integration
Webinar: The 5 Most Critical Things to Understand About Modern Data IntegrationWebinar: The 5 Most Critical Things to Understand About Modern Data Integration
Webinar: The 5 Most Critical Things to Understand About Modern Data IntegrationSnapLogic
 
What is BI on Cloud
What is BI on CloudWhat is BI on Cloud
What is BI on Cloudtdwiindia
 
Connect Faster with SnapLogic at Workday Rising
Connect Faster with SnapLogic at Workday RisingConnect Faster with SnapLogic at Workday Rising
Connect Faster with SnapLogic at Workday RisingSnapLogic
 
Data Services and the Modern Data Ecosystem (ASEAN)
Data Services and the Modern Data Ecosystem (ASEAN)Data Services and the Modern Data Ecosystem (ASEAN)
Data Services and the Modern Data Ecosystem (ASEAN)Denodo
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Microsof azure class 1- intro
Microsof azure   class 1- introMicrosof azure   class 1- intro
Microsof azure class 1- introMHMuhammadAli1
 

Mais procurados (20)

Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?Beyond Batch: Is ETL still relevant in the API economy?
Beyond Batch: Is ETL still relevant in the API economy?
 
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
Data Engineer, Patterns & Architecture The future: Deep-dive into Microservic...
 
The SnapLogic Integration Cloud for ServiceNow
The SnapLogic Integration Cloud for ServiceNowThe SnapLogic Integration Cloud for ServiceNow
The SnapLogic Integration Cloud for ServiceNow
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake Practice
 
The Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management StackThe Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management Stack
 
Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies Data Warehousing in the Cloud: Practical Migration Strategies
Data Warehousing in the Cloud: Practical Migration Strategies
 
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the EnterpriseWebinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
 
PgConf 2018 - Postgres in a World of DevOps
PgConf 2018 - Postgres in a World of DevOpsPgConf 2018 - Postgres in a World of DevOps
PgConf 2018 - Postgres in a World of DevOps
 
Business Intelligence in the Cloud I
Business Intelligence in the Cloud IBusiness Intelligence in the Cloud I
Business Intelligence in the Cloud I
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Postgres Vision 2018: The Changing Role of the DBA in the Cloud
Postgres Vision 2018: The Changing Role of the DBA in the CloudPostgres Vision 2018: The Changing Role of the DBA in the Cloud
Postgres Vision 2018: The Changing Role of the DBA in the Cloud
 
Webinar: The 5 Most Critical Things to Understand About Modern Data Integration
Webinar: The 5 Most Critical Things to Understand About Modern Data IntegrationWebinar: The 5 Most Critical Things to Understand About Modern Data Integration
Webinar: The 5 Most Critical Things to Understand About Modern Data Integration
 
What is BI on Cloud
What is BI on CloudWhat is BI on Cloud
What is BI on Cloud
 
Connect Faster with SnapLogic at Workday Rising
Connect Faster with SnapLogic at Workday RisingConnect Faster with SnapLogic at Workday Rising
Connect Faster with SnapLogic at Workday Rising
 
Data Services and the Modern Data Ecosystem (ASEAN)
Data Services and the Modern Data Ecosystem (ASEAN)Data Services and the Modern Data Ecosystem (ASEAN)
Data Services and the Modern Data Ecosystem (ASEAN)
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Business Intelligence In The Cloud
Business Intelligence In The CloudBusiness Intelligence In The Cloud
Business Intelligence In The Cloud
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Microsof azure class 1- intro
Microsof azure   class 1- introMicrosof azure   class 1- intro
Microsof azure class 1- intro
 
Informatica Cloud Overview
Informatica Cloud OverviewInformatica Cloud Overview
Informatica Cloud Overview
 

Semelhante a Why Migrate from MySQL to Cassandra

Guide to NoSQL with MySQL
Guide to NoSQL with MySQLGuide to NoSQL with MySQL
Guide to NoSQL with MySQLSamuel Rohaut
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsRackspace
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paperjuly12jana
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017Jeremy Maranitch
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaJyrki Määttä
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Denodo
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperImpetus Technologies
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lakesambiswal
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?Denodo
 
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdfXIAOZEJIN1
 
Big data - Cassandra
Big data - CassandraBig data - Cassandra
Big data - CassandraJen Wei Lee
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGateJeffrey T. Pollock
 
Guide to MySQL Embedded
Guide to MySQL EmbeddedGuide to MySQL Embedded
Guide to MySQL EmbeddedVlad Alexandru
 

Semelhante a Why Migrate from MySQL to Cassandra (20)

Guide to NoSQL with MySQL
Guide to NoSQL with MySQLGuide to NoSQL with MySQL
Guide to NoSQL with MySQL
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High Expectations
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-cloudera
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
Data Fabric - Why Should Organizations Implement a Logical and Not a Physical...
 
Building a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White PaperBuilding a Big Data Analytics Platform- Impetus White Paper
Building a Big Data Analytics Platform- Impetus White Paper
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
¿Cómo modernizar una arquitectura de TI con la virtualización de datos?
 
Kickfire: Best Of All Worlds
Kickfire: Best Of All WorldsKickfire: Best Of All Worlds
Kickfire: Best Of All Worlds
 
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdffinal-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
 
Big data - Cassandra
Big data - CassandraBig data - Cassandra
Big data - Cassandra
 
Microservices Patterns with GoldenGate
Microservices Patterns with GoldenGateMicroservices Patterns with GoldenGate
Microservices Patterns with GoldenGate
 
Guide to MySQL Embedded
Guide to MySQL EmbeddedGuide to MySQL Embedded
Guide to MySQL Embedded
 

Mais de DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Mais de DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Último

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Último (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

Why Migrate from MySQL to Cassandra

  • 1. Why Migrate from MySQL to Cassandra? WHITE PAPER By DataStax Corporation July 2012
  • 2. Contents Introduction ..................................................................................................................................... 3   Why Stay with MySQL? ................................................................................................................... 4   Why Migrate from MySQL? ............................................................................................................. 4   Architectural Limitations .............................................................................................................. 6   Data Model Limitations ............................................................................................................... 6   Scalability and Performance Limitations ..................................................................................... 7   Why Migrate to Cassandra and DataStax Enterprise? .................................................................... 7   A Technical Overview of Cassandra ........................................................................................... 7   Cassandra vs. Other NoSQL Solutions .................................................................................... 10   Who’s Using Cassandra? ......................................................................................................... 10   A Quick Look at DataStax ......................................................................................................... 11   DataStax Enterprise – The Choice for Production Big Data Deployments ............................... 11   What About Cost? ..................................................................................................................... 13   How to Migrate from MySQL to Cassandra ................................................................................... 13   Using Sqoop to Migrate from MySQL ....................................................................................... 14   Using Pentaho Kettle to Migrate from MySQL .......................................................................... 15   Examples of Customers Who Have Switched from MySQL .......................................................... 16   Mahalo ...................................................................................................................................... 16   Pantheon Systems .................................................................................................................... 17   Formspring ................................................................................................................................ 17   Conclusion ..................................................................................................................................... 18   About DataStax ............................................................................................................................. 18   Appendix A – FAQ on Switching from MySQL to DataStax Enterprise/Cassandra ...................... 19   Appendix B – Comparing MySQL and DataStax Enterprise/Cassandra ....................................... 21   General Comparisons ............................................................................................................... 21   Datatype Comparisons ............................................................................................................. 23   © 2012 DataStax. All rights reserved. 2
  • 3. Introduction Founded in 1995, MySQL has been one of the de facto infrastructure pieces in applications that target the Web. The database component of the Internet LAMP stack, MySQL has enjoyed wide success in adoption. This is for good reason: MySQL provides solid relational database management system (RDBMS) capabilities in an open source package that enables companies to build systems that perform well from a database perspective in many general purpose use cases. MySQL was acquired by Sun Microsystems in 2008, and was formally acquired by Oracle in January 2010 (via the company’s acquisition of Sun Microsystems). Now part of Oracle’s stable of database products, MySQL continues to be promoted and sold through Oracle. In addition to Oracle, a number of other vendors have either forked MySQL to create another database offering or are using MySQL as part of a specialized service offering. Examples include Percona, Monty Program AB, Infobright, Calpont, and Amazon RDS. While Oracle’s MySQL remains a good RDBMS that performs well for the use cases it was designed for, even its strongest supporters admit that it is not architected to tackle the new wave of big data applications being developed today. In fact, in the same way that the needs of late 20th century web companies helped give birth to MySQL and drive its success, modern businesses today that need to manage big data use cases are helping to forge a different set of technologies that are replacing MySQL in many situations. This paper examines the why’s and how’s of migrating from Oracle’s MySQL to these new big data technologies, such as Apache Cassandra™ and Apache Hadoop™. It also looks at the benefits of moving from MySQL to a fully integrated data stack that combines those technologies and others together into a single big data platform like that found in DataStax Enterprise. © 2012 DataStax. All rights reserved. 3
  • 4. Why Stay with MySQL Before continuing with a discussion on why it might be necessary to either develop new applications or migrate existing MySQL systems to another database platform, it first makes sense to understand why staying with MySQL as a database service may be the right choice. Because database migrations in particular can be resource-intensive, IT professionals should ensure they are making the right decision before making such a move. Oracle’s MySQL and MySQL Cluster are good tools that satisfy a wide range of use cases. In general, IT managers should consider staying with MySQL if either their existing system or planned new application exhibits the following characteristics and needs: • ACID-compliant transactions, with nested transactions, commits/rollbacks, and full referential integrity required. • A very normalized data model that is well served by the Codd-Date relational design, and one where join operations cannot be avoided. • Data is primarily structured with little to no unstructured or semi-structured data being present. • Low to moderate data volumes that can be handled easily by the MySQL optimizer. • Telco applications that require use of main memory solutions and whose data is primarily accessed via primary keys. • Scale out architectures that are primarily read in nature, with no need to write to multiple masters or servers that exist in many different cloud zones or geographies. • No requirement for a single database/cluster to span many different data centers. • High availability requirements can be accomplished via a synchronous replication architecture that is primarily maintained at a single data center. If an application presents these and similar requirements, then Oracle’s MySQL may indeed be a good fit as a database platform. Why Migrate from MySQL? One primary reason why various IT organizations are either already migrating away from Oracle’s MySQL or planning to do so is the very visible rise of big data applications – and specifically, big data online transaction processing (OLTP) applications. These companies either have existing systems morphing into big data systems, or they are planning new applications that are big data in nature and need something “more” than MySQL for their database platform. Although the top industry IT analyst groups may disagree on various technical and marketplace trends, they are, remarkably, in agreement on the following three things: (1) the upsurge of big data applications; (2) the definition of what constitutes big data, and (3) the need for new technology to deal with big data. © 2012 DataStax. All rights reserved. 4
  • 5. The momentum and growth of big data applications is unmistakable. Underscoring this is a recent survey of 600 IT professionals that revealed nearly 70 percent of organizations are now 1 considering, planning, or running big data projects. As for a definition of big data, it is nearly universally agreed that big data involves one or all of the following: 1. Velocity – data coming in at extreme rates of speed. 2. Variety – the types of data needing to be captured (structured, semi-structured, and unstructured). 3. Volume – sizes that potentially involve terabytes to petabytes of data. 4. Complexity – involves everything from moving operational data into big data platforms, maintaining various data “silos,” and the difficulty in managing data across multiple sites and geographies. Although they may phrase it differently, top analyst groups and market observers agree that to tackle big data applications that have the above characteristics, something more than standard RDBMSs, like Oracle’s MySQL, is needed. For example, David Kellogg says big data is “too big to be reasonably handled by 2 current/traditional technologies.” Consulting and research firm McKinsey & Company agrees with Kellogg’s concept of big data and defines it as “datasets whose size is beyond the ability of 3 typical database software tools to capture, store, manage, and analyze.” IDC says: “Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by 4 enabling high-velocity capture, discovery, and/or analysis.” And finally, O’Reilly defines big data in this way: “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an 5 alternative way to process it.” In particular, big data applications are nudging MySQL aside for other options. While MySQL initially gained its popularity through use of the MyISAM storage engine, the InnoDB transactional 1 “Survey: 70 Percent of Organizations Have Big Plans for Big Data,” by Pedro Hernandez, EntepriseAppsToday.com, May 14, 2012: http://www.enterpriseappstoday.com/data-management/survey-70-percent-enterprises-big-plans-for-big-data.html. 2 “‘Big data’ has jumped the shark,” DBMS2, September 11, 2011: http://www.dbms2.com/2011/09/11/big-data-has-jumped-the- shark/. 3 Big Data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, May 2011, p. 11: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation. 4 Extracting Value from Chaos, by John Gantz and David Reinsel, IDC, June 2011: http://idcdocserv.com/1142. 5 “What Is Big Data? An Introduction to the Big Data Landscape,” by Edd Dumbill, O’Reilly Radar, January 11, 2012: http://radar.oreilly.com/2012/01/what-is-big-data.html. © 2012 DataStax. All rights reserved. 5
  • 6. engine is arguably the most used today. But InnoDB isn’t designed to handle the types of big data requirements discussed above. What types of limitations, bottlenecks, and issues are MySQL users experiencing? Although the exact situations vary, a few of the most prevalent reasons that cause a move from MySQL are as follows: Architectural Limitations One reason modern businesses are switching from Oracle’s MySQL to big data platforms is because the underlying architecture does not support key big data use cases. This is true regardless of which MySQL products are being considered –MySQL Community/Enterprise, MySQL Cluster, or database services such as Amazon RDS. Some of the architectural issues that arise when MySQL is thrown into big data situations include: • The traditional master-slave architecture of MySQL (one write master with 1-n slaves) prohibits “location independent” or “read/write anywhere” use cases that are very common in big data environments where a database cluster is spread out throughout many different geographies and data centers (and the cloud), with each node needing to support both reads and writes. • The necessity to manually shard (i.e., partition) general MySQL systems to overcome various performance shortcomings becomes a very time-consuming, error-prone, and expensive proposition to support. It also places a heavy burden on development staff to support sharding logic in the application. • Failover and failback situations tend to require manual intervention with generally replicated MySQL systems. Failback can be especially challenging. • Although it provides automatic sharding and supports simple geographic replication, MySQL Cluster’s dependence on synchronous replication can cause latency and transactional response time issues. Further, its geographic replication does not support multiple (i.e., >2) data centers in a way that either performs well or is easy to manage. • Database services such as Amazon’s RDS suffer from the same shortfalls above as Amazon only supports either a simple standby server that is maintained in a different availability zone in Amazon’s cloud, or a series of read replicas that are provisioned and used to help service increased query (not write) traffic. Data Model Limitations A big reason why many businesses are moving to NoSQL-based solutions is because the legacy RDBMS data model is not flexible enough to handle big data use cases that contain a mixture of structured, semi-structured, and unstructured data. While MySQL has good datatype support for traditional RDBMS situations that deal with structured data, it lacks the dynamic data model necessary to tackle high-velocity data coming in from machine-generated systems or time series applications, as well as cases needing to manage semi-structured and unstructured data. © 2012 DataStax. All rights reserved. 6
  • 7. Recently, Oracle announced it had introduced a NoSQL-type interface into its MySQL Cluster product that is key/value in design. While certainly helpful in some situations, such a design still falls short in key big data use cases like time series applications that require inserting data into structures that support tens of thousands of columns. Scalability and Performance Limitations Oracle’s MySQL has long been touted as a scale-out database. However, those who know and use MySQL admit it has limitations that negate its use in big data situations where scalability is required. For example: • More servers can be added to a general MySQL Community/Enterprise cluster to help service more reads, but writes are still bottlenecked via the main write master server. Moreover, if many read slave servers are required, latency issues can arise in the process of simply getting the data from the master server to all the slaves. • Consumption of high-velocity data can be challenging, especially if the InnoDB storage engine is used, as the index-organized structure often does not handle high insert rates well. Third-party storage engine vendors, which are columnar in nature, typically cannot help in this case, as they rely on their proprietary high-speed loaders to load data quickly into a database. • Data volumes over half a terabyte become a real challenge for the MySQL optimizer. To overcome this, a third-party storage engine vendor such as Calpont or Infobright must be used – but these vendors have limitations either in their SQL support, MPP capabilities, or both. Why Migrate to Cassandra and DataStax Enterprise? While a move from Oracle’s MySQL may be necessary because of its inability to handle key big data use cases, why should that move involve a switch to Apache Cassandra and DataStax Enterprise? The sections that follow describe why a move to Cassandra and DataStax Enterprise make both technical and business sense for MySQL users seeking alternatives. A Technical Overview of Cassandra Apache Cassandra, an Apache Software Foundation project, is an open source NoSQL distributed database management system. Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance. © 2012 DataStax. All rights reserved. 7
  • 8. In selecting an alternative to Oracle’s MySQL, IT professionals will find Apache Cassandra is a standout among other NoSQL offerings for the following technical reasons: • Massively scalable architecture – Cassandra’s masterless, peer-to-peer architecture overcomes the limitations of master-slave designs and allows for both high availability and massive scalability. Cassandra is the acknowledged NoSQL leader when it comes to comfortably scaling to terabytes or petabytes of data, while maintaining industry-leading write and read performance. • Linear scale performance – Nodes added to a Cassandra cluster (all done online) increase the throughput of a database in a predictable, linear fashion for both read and write operations – even in the cloud, where such predictability can be difficult to ensure. • Continuous availability – Data is replicated to multiple nodes in a Cassandra database cluster to protect from loss during node failure and to provide continuous availability with no downtime. • Transparent fault detection and recovery – Cassandra clusters can grow into the hundreds or thousands of nodes. Because Cassandra was designed for commodity servers, machine failure is expected. Cassandra utilizes gossip protocols to detect machine failure and recover when a machine is brought back into the cluster – all without the application noticing. • Flexible, dynamic schema data modeling – Cassandra offers the organization of a traditional RDBMS table layout combined with the flexibility and power of no stringent structure requirements. This allows data to be stored dynamically as needed without performance penalty for changes that occur. In addition, Cassandra can store structured, semi-structured, and unstructured data. • Guaranteed data safety – Cassandra far exceeds other systems on write performance due to its append-only commit log while always ensuring durability. Users must no longer trade off durability to keep up with immense write streams. Data is absolutely safe in Cassandra; data loss is not possible. • Distributed, location independence design – Cassandra’s architecture avoids the hot spots and read/write issues found in master-slave designs. Users can have a highly distributed database (e.g., multi-geography, multi-data center) and read or write to any node in a cluster without concern over which node is being accessed. • Tunable data consistency – Cassandra offers flexible data consistency on a cluster, data center, or individual I/O operation basis. Very strong or eventual data consistency among all participating nodes can be set globally and also controlled on a per-operation basis (e.g., per INSERT, per UPDATE). • Multi-data center replication – Whether it’s keeping data in multiple locations for disaster recovery scenarios or locating data physically near its end users for fast performance, Cassandra offers support for multiple data centers. Administrators simply configure how many copies of the data they want in each data center, and Cassandra handles the rest – replicating the data automatically. Cassandra is also rack-aware and can keep replicas of © 2012 DataStax. All rights reserved. 8
  • 9. data stored on different physical racks, which helps ensure uptime in the case of single rack failures. • Cloud-enabled – Cassandra’s architecture maximizes the benefits of running in the cloud. Also, Cassandra allows for hybrid data distribution where some data can be kept on-premise and some in the cloud. • Data compression – Cassandra supplies built-in data compression, with up to an 80 percent reduction in raw data footprint. More importantly, Cassandra’s compression results in no performance penalty, with some use cases showing actual read/write operations speeding up due to less physical I/O being required. • CQL (Cassandra Query Language) – Cassandra provides a SQL-like language called CQL that mirrors SQL’s DDL, DML, and SELECT syntax. CQL greatly decreases the learning curve for those coming from RDBMS systems because they can use familiar syntax for all object creation and data access operations. • No caching layer required – Cassandra offers caching on each of its nodes. Coupled with Cassandra’s scalability characteristics, nodes can be incrementally added to the cluster to keep as much data in memory as needed. The result is that there is no need for a separate caching layer. • No special hardware needed – Cassandra runs on commodity machines and requires no expensive or special hardware. • Incremental and elastic expansion – The Cassandra ring allows online node additions. Because of Cassandra’s fully distributed architecture, every node type is the same, which means clusters can grow as needed without any complex architecture decisions. • Simple install and setup – Cassandra can be downloaded and installed in minutes, even for multi-cluster installs. • Ready for developers – Cassandra has drivers and client libraries for all the popular development languages (e.g., Java, Python). Given these technical features and benefits, the following are typical big data use cases handled well by Cassandra in the enterprise: • Big data situations • Time series data management • High-velocity device data ingestion and analysis • Healthcare system input and analysis • Media streaming management (e.g., music, movies) • Social media (i.e., unstructured data) input and analysis • Online web retail (e.g., shopping carts, user transactions) • Real-time data analytics • Online gaming (e.g., real-time messaging) • Software as a Service (SaaS) applications that utilize web services • Write-intensive systems © 2012 DataStax. All rights reserved. 9
  • 10. Cassandra vs. Other NoSQL Solutions What does the performance of Cassandra look like compared to other NoSQL options? While each use case is different, external 6 benchmarks such as the YCSB test show Cassandra to outperform its rivals in a number of situations. This particular benchmark (right) shows Cassandra delivering nearly 4x the write performance, 2x the read performance, and better than 12x overall performance in a mixed workload use case over another leading NoSQL provider. Who’s Using Cassandra? One benefit MySQL users have enjoyed is a large community of users who have deployed the database in many production environments. Cassandra, likewise, is used in many industries for modern applications that need scale, fast performance, data flexibility, and easy data distribution. Below is a snapshot of some of the companies and organizations that use Cassandra in production. (Note: There are many other household name companies with production implementations that cannot be published due to NDA restrictions.) 6 “NoSQL Benchmarking,” CUBRID: http://blog.cubrid.org/dev-platform/nosql- benchmarking/?utm_source=NoSQL+Weekly+List&utm_campaign=143fae86b2- NoSQL_Weekly_Issue_41_September_8_2011&utm_medium=email. © 2012 DataStax. All rights reserved. 10
  • 11. A Quick Look at DataStax A viable company behind an open source offering is vital for enterprises wanting to enjoy the benefits provided by open source software, but also needing the professional requirements supplied by proprietary software offerings. DataStax is the leading provider of modern enterprise database software products and services based on Apache Cassandra. It employs the Apache chair of Cassandra as well as most of the project’s committers. At the time of this writing, DataStax has nearly 50 employees and over 170 customers, though both of these statistics are growing rapidly. DataStax provides free and open source NoSQL products as well as commercial solutions aimed at production big data environments, with its flagship solution being DataStax Enterprise. DataStax Enterprise – The Choice for Production Big Data Deployments Apache Cassandra can be likened to MySQL’s community server in that both are free and open source. By contrast, DataStax Enterprise is more akin to MySQL Enterprise or MySQL Cluster’s Carrier Grade editions in that it is designed for production deployments that power key business systems. DataStax Enterprise is tailor-made to manage big data effectively. The solution inherits Cassandra’s entire, powerful feature set for servicing modern big data operational applications, and smartly integrates a fault- © 2012 DataStax. All rights reserved. 11
  • 12. tolerant analytics platform that provides Hadoop™ MapReduce, Hive, Pig, Mahout, and Sqoop support for business intelligence systems. It also includes enterprise search capabilities via Apache Solr™, which is the most popular open source search software in use today. A key differentiator of DataStax Enterprise over other big data providers is that real-time, analytic, and search workloads are intelligently isolated across a distributed DataStax Enterprise database cluster, so that no competition for underlying compute resources or data occurs. DataStax Enterprise is comprised of three components: 1. The DataStax Enterprise Server – built on Apache Cassandra, the server manages real- time data with Cassandra, analytic data with Hadoop, and enterprise search data with Apache Solr. 2. OpsCenter Enterprise – a visual, browser-based solution for managing and monitoring Cassandra and the DataStax Enterprise server. 3. Production Support – full 24x7x365 support from the big data experts at DataStax. With Hadoop and Solr, the types of use cases that can be tackled with DataStax Enterprise grow exponentially beyond those previously covered with Cassandra alone and include: • Social media input and analysis • Web clickstream analysis • Buyer event and behavior analytics • Fraud detection and analysis • Risk analysis and management • Supply chain analytics • Web product searches • Internal document search (e.g., law firms) • Real estate/property searches • Social media matchups • Web and application log management/analysis © 2012 DataStax. All rights reserved. 12
  • 13. What About Cost? There are many technical benefits in moving from Oracle’s MySQL to Cassandra and DataStax Enterprise, but what about cost? How does DataStax compare with Oracle in that regard? As of May 2012, the list prices for Oracle’s MySQL products were priced by subscription, per 7 server/socket, and were as follows : Product List Price MySQL Standard Edition Subscription (1-4 socket server) $2,000 MySQL Standard Edition Subscription (5+ socket server) $4,000 MySQL Enterprise Edition Subscription (1-4 socket server) $5,000 MySQL Enterprise Edition Subscription (5+ socket server) $10,000 MySQL Cluster Carrier Grade Edition Subscription (1-4 socket server) $10,000 MySQL Cluster Carrier Grade Edition Subscription (5+ socket server) $20,000 For feature differences in the MySQL products listed above, see http://www.mysql.com/products/. Like the MySQL Community Edition, DataStax provides a free edition of Apache Cassandra (the DataStax Community Edition), which comes with the latest version of Cassandra, a free version of DataStax’s OpsCenter management and monitoring tool, and quick-start developer aids. From a production, enterprise perspective, a subscription to DataStax Enterprise Edition may be compared to the MySQL Enterprise and MySQL Cluster Carrier Grade Edition subscriptions. DataStax does not currently price by socket – only by server, with pricing either being on par with MySQL Enterprise for small boxes or, with beefier machines, substantially less (e.g., 50 to 70 percent). How to Migrate from MySQL to Cassandra The first step in migrating from MySQL to Cassandra is to understand that data modeling is handled differently in NoSQL solutions vs. RDBMSs. In traditional databases such as MySQL, data is modeled in standard “third normal form” design without the need to know what questions will be asked of the data. By contrast, in NoSQL, the questions asked of the data are what drive the data model design and the data is highly denormalized. 7 “MySQL Global Price List, Software Investment Guide,” Oracle, November 1, 2010: http://www.oracle.com/us/corporate/pricing/price-lists/mysql-pricelist-183985.pdf. © 2012 DataStax. All rights reserved. 13
  • 14. If a developer is simply interested in porting MySQL schema and data over into a Cassandra keyspace (analogous to a database in MySQL) just for testing or preliminary development purposes, there are two primary options: 1. Use the Sqoop interface to move MySQL tables and data. 2. Use ETL (extract-transform-load) tools such as Pentaho’s Kettle to move schema and data. Using Sqoop to Migrate from MySQL DataStax Enterprise supports Sqoop, which is a utility designed to transfer data between an RDBMS and Hadoop. Given that DataStax Enterprise combines Cassandra, Hadoop, and Solr together into one big data platform, a developer can move data not only to a Hadoop system with Sqoop, but also Cassandra. The DataStax Enterprise installation package includes a sample/demo of how to move MySQL schema and data into Cassandra. Because Sqoop works via Java Database Connectivity (JDBC), the only prerequisite is that the JDBC driver for MySQL must be downloaded from the MySQL website and placed in a directory where Sqoop has access to it (the /sqoop subdirectory of the main DataStax Enterprise installation is recommended). Each MySQL table is mapped to a Cassandra column family. Column families are a Google Bigtable structure, with rows and columns like MySQL but much more dynamic and flexible. The migration is done via a command line utility that accepts a number of different parameters. For example, the following code migrates a MySQL table contained on a server with IP address 127.0.0.1 that’s in the dev database. It connects to MySQL with the root ID and uses no password. It migrates a table called npa_nxx from MySQL into a Cassandra keyspace named dev, to a column family named npa_nxx_cf, identifies the MySQL column npa_nxx_key as the primary key, names the Cassandra server’s host IP, and lastly, asks for the schema to be created before the data is imported: ./dse sqoop import --connect jdbc:mysql://127.0.0.1/dev --username root --table npa_nxx --cassandra-keyspace dev --cassandra-column-family npa_nxx_cf --cassandra-row-key npa_nxx_key --cassandra-thrift-host 127.0.0.1 --cassandra-create-schema © 2012 DataStax. All rights reserved. 14
  • 15. Figure 1 – Using Sqoop to move data from MySQL to DataStax Enterprise Using Pentaho Kettle to Migrate from MySQL Another way to migrate MySQL tables and data to Cassandra is by using a number of ETL tools on the market such as Pentaho’s Data Integration product, also known as Kettle. Pentaho makes two editions of their ETL tool available: a free community edition and a paid enterprise edition. For core ETL tasks such as moving MySQL schema and data to Cassandra, the community edition should provide everything that is needed. Figure 2 – Kettle’s visual interface for performing ETL operations © 2012 DataStax. All rights reserved. 15
  • 16. Pentaho’s Kettle product provides an easy-to-use graphical user interface (GUI) that allows developers to visually design their MySQL migration tasks. Unlike the Sqoop utility, which just does extract-load, Pentaho’s product allows developers to create simple to sophisticated transformation routines to customize how a MySQL schema and data are moved to Cassandra. In addition, the data movement engine of Kettle is quite efficient, so medium to semi-large data volumes can be moved in a high-performance manner. More information about Kettle and free downloads can be found at: http://kettle.pentaho.com/. Examples of Customers Who Have Switched from MySQL Whether it’s by using Sqoop, ETL tools, or other in-house-developed options, a move from Oracle’s MySQL to big data platforms like Cassandra and DataStax Enterprise is not difficult at all. DataStax has many customers who have made just such a switch, but do not discuss it from an external-facing standpoint. However, the following are a few customers who were kind enough to let us share their use cases publicly. Mahalo Mahalo is a social media learning business that has a top 200 web ranking and experiences 12 million visits per month. Mahalo needed a database provider to manage the activity log for every single customer interaction as well as store information for their Q&A topic database. Mahalo started using Oracle’s MySQL. However, performance and availability issues necessitated a move to a database that could scale and handle their heavy write workload. Further, the company needed a more dynamic data model to store the variety of data that was coming in. Mahalo chose Cassandra and DataStax as their MySQL replacement. Mahalo’s chief technology officer (CTO), Jason Burch, says: “With the Cassandra conversion completed and running smoothly, we’re now free to focus on our primary mission, knowing that we’ll be able to deliver the excellent responsiveness and capabilities that our user community has come to expect.” © 2012 DataStax. All rights reserved. 16
  • 17. Pantheon Systems Pantheon Systems provides a cloud-based web development platform for websites made with Drupal. Pantheon needed a primary database platform that would hold all metadata information that supports their main application platform and also all of their media storage. Pantheon began with MySQL, but found it was unable to handle its requirements of managing both structured and unstructured data. Moreover, MySQL could not scale or support Pantheon’s need for continuous availability across multiple data centers. Pantheon switched to Cassandra and DataStax, which met all of their specific requirements. David Strauss, Pantheon’s CTO, describes the company’s use of Cassandra this way: “All the actual platform data in Pantheon is persisted primarily to Cassandra. We could wipe out pretty much everything on Pantheon, but as long as the Cassandra store is there, we have our data.” Formspring Formspring is a social media provider that serves as a group knowledge interaction site. Formspring has over 26 million registered members worldwide who provide more than 10 million responses daily, and the company receives over 30 million unique visitors per month. Formspring’s initial database platform of MySQL and Amazon’s SimpleDB could not scale, so the company turned to DataStax and Cassandra instead. Cassandra proved to be the only solution Formspring evaluated that could produce the response times and continuous availability that the company’s system needed. Kyle Ambroff, senior engineer at Formspring, says: “I can safely say that if we had tried to implement some of Formspring’s new features without using Cassandra, we would’ve had to double the size of our operations staff just because we’d be adding more single points of failure.” © 2012 DataStax. All rights reserved. 17
  • 18. Conclusion There is no argument that Oracle’s MySQL is a good RDBMS – and one that well serves the use cases for which it was originally designed. But for IT professionals who are either planning new big data applications or have existing MySQL systems that have begun to break down under big data workloads, a move to DataStax Enterprise and Cassandra makes both business and technical sense. Switching to a modern, big data platform like DataStax Enterprise will future-proof any application, and provides confidence that the system will scale and perform well both now and into a demanding future. For more information on DataStax Enterprise and Cassandra, visit www.datastax.com. For downloads of DataStax Enterprise – which may be freely used for development purposes – visit http://www.datastax.com/download/enterprise. About DataStax DataStax offers products and services based on the popular open source database, Apache Cassandra™ that solve today’s most challenging big data problems. DataStax Enterprise combines the performance of Cassandra with analytics powered by Apache Hadoop and enterprise search with Apache Solr, creating a smartly integrated, big data platform. With DataStax Enterprise, real-time, analytic, and search workloads never conflict, giving you maximum performance with the added benefit of only managing a single database. The company has over 170 customers, including leaders such as Netflix, Disney, Cisco, Rackspace, and Constant Contact, and spans verticals including web, financial services, telecommunications, logistics and government. DataStax is backed by industry-leading investors, including Lightspeed Venture Partners and Crosslink Capital, and is based in San Mateo, CA. For more information, visit www.datastax.com. © 2012 DataStax. All rights reserved. 18
  • 19. Appendix A – FAQ on Switching from MySQL to DataStax Enterprise/Cassandra This appendix supplies answers to frequently asked questions about migrating from Oracle’s MySQL to DataStax Enterprise/Cassandra. Do I lose transaction support when moving from MySQL to Cassandra? MySQL’s InnoDB supplies ACID transaction support, whereas Cassandra provides AID transaction support. The “C” or consistency part of transaction support does not apply to Cassandra, as there is no concept of referential integrity or foreign keys in a NoSQL database. There is also no concept of commit/rollback in Cassandra. Batch operations are supported in Cassandra via the BATCH option in CQL. What type of data consistency does DataStax Enterprise/Cassandra support? DataStax Enterprise and Cassandra support “tunable data consistency.” This type of consistency is the kind represented by the “C” in the CAP theorem, which concerns distributed systems. Cassandra extends the concept of “eventual consistency” in NoSQL databases by offering tunable consistency. For any given read or write operation, the client application decides how consistent the requested data should be. Consistency levels in Cassandra can be set on any read or write query. This allows application developers to tune consistency on a per-query/operation basis depending on their requirements for response time versus data accuracy. Cassandra offers a number of consistency levels for both reads and writes. What parts of my MySQL database cannot be migrated to DataStax Enterprise/Cassandra? Schema, data, and general indexes may be migrated, but objects that currently cannot be migrated include: • Stored procedures • Views • Triggers • Functions • Security privileges • Referential integrity constraints • Rules • Partitioned table definitions Do I need to use a MySQL caching layer (like memcached) with Cassandra? No. Cassandra negates the need for extra software caching layers like memcached through its distributed architecture, fast write throughput capabilities, and internal memory caching structures. When you want more memory cache available to your cluster, you simply add more nodes and it will handle the rest for you. © 2012 DataStax. All rights reserved. 19
  • 20. Is data absolutely safe in Cassandra? Yes. First, data durability is fully supported in Cassandra, so any data written to a database cluster is first written to a commit log in the same fashion as nearly every popular RDBMS does. Second, Cassandra offers tunable data consistency. This means a developer or administrator can choose how strong they wish consistency across nodes to be. The strongest form of consistency is to mandate that any data modifications be made to all nodes, with any unsuccessful attempt on a node resulting in a failed data operation. Cassandra provides consistency in the CAP sense, in that all readers will see the same values. How is data written and stored in Cassandra? Cassandra has been architected for consuming large amounts of data as fast as possible. To accomplish this, Cassandra first writes new data to a commit log to ensure it is safe. After that, the data is written to an in-memory structure called a memtable. Cassandra deems the write successful once it is stored on both the commit log and a memtable, which provides the durability required for mission-critical systems. Once a memtable’s memory limit is reached, all writes are then written to disk in the form of an SSTable (sorted strings table). An SSTable is immutable, meaning it is not written to ever again. If the data contained in the SSTable is modified, the data is written to Cassandra in an upsert fashion and the previous data automatically removed. Because SSTables are immutable and only written once the corresponding memtable is full, Cassandra avoids random seeks and instead only performs sequential I/O in large batches, resulting in high write throughput. A related factor is that Cassandra doesn’t have to do a read as part of a write (i.e., check index to see where current data is). This means that insert performance remains high as data size grows, while with b-tree based engines (e.g., MongoDB) it deteriorates. What kind of query language is provided in Cassandra? Is it like SQL in MySQL? Cassandra supplies the Cassandra Query Language (CQL), which is very SQL-like. Queries are done via the standard SELECT command, while DML operations are accomplished via the familiar INSERT, UPDATE, DELETE, and TRUNCATE commands. DDL commands such as CREATE are used to create new keyspaces and column families. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOINs, for example. © 2012 DataStax. All rights reserved. 20
  • 21. Appendix B – Comparing MySQL and DataStax Enterprise/Cassandra This technical appendix provides brief comparisons between Oracle’s MySQL and DataStax Enterprise/Cassandra. General Comparisons Feature/Function MySQL DataStax Enterprise/Cassandra Platform support Linux, Windows, Solaris, Unix, Linux, Windows, Mac Mac Data model Relational/tabular Google Bigtable Primary data object Table Column family Data variety support Primarily structured Structured, semi-structured, unstructured Data partitioning/sharding Manual in general MySQL; Automatic model automatic in MySQL Cluster Logical database container Database Keyspace Indexes Primary, secondary, clustered, Primary, secondary full-text Distribution architecture Master/slave or synchronous Peer-to-peer with replication replication with MySQL Cluster CAP consistency model Synchronous with MySQL Tunable consistency per Cluster operation Multi-data center support Basic Multi-data center and cloud, with rack awareness Transaction support ACID AID (no “C” as there are no foreign keys) Memory usage model General caches, query cache, Distributed object/row caches main memory option with across all nodes in a cluster © 2012 DataStax. All rights reserved. 21
  • 22. Feature/Function MySQL DataStax Enterprise/Cassandra MySQL Cluster Core language SQL CQL Primary query utilities mysql command line client CQL shell; CLI Development language support Many (e.g., Java, Python) Many (e.g., Java, Python) Large data volume support Low TBs done with third-party Native, TB-PB support (TBs+) storage engines Data compression Built into some storage engines Built in Analytic support Some analytic functions Done via Hadoop (MapReduce, Hive, Pig, Mahout) Search support Full-text indexes Done via Solr integration Geospatial support Spatial extensions Done via Solr integration Logging (e.g., web, application) Nothing built in Handled via log4j data support Mixed workload support Must separate/ETL data All handled in one cluster with between OLTP, analytic, search built-in workload isolation Backup/recovery Online, point-in-time restore Online, point-in-time restore Enterprise MySQL Enterprise Monitor DataStax OpsCenter management/monitoring © 2012 DataStax. All rights reserved. 22
  • 23. Datatype Comparisons CQL datatype MySQL datatype Description blob blob Arbitrary hexadecimal bytes (no validation) ascii char US-ASCII character string text, varchar text, varchar UTF-8 encoded string varint numeric Arbitrary-precision integer int, bigint int, bigint 8-bytes long uuid None Type 1 or type 4 UUID timestamp timestamp Date plus time, encoded as 8 bytes since epoch boolean bit True or false float float 4-byte floating point double double 8-byte floating point decimal decimal Variable-precision decimal counter None Distributed counter value (8-bytes long) © 2012 DataStax. All rights reserved. 23