2. Abstract
Alternative data persistence technologies like NoSQL emerged since more than
10 years, but we developers hesitate to open our horizon for these new
approaches. Why should we?
Relational databases dominated the IT industry for a long time and served us
very well. Everybody knows SQL and is used to the relational data model with
all its advantages and disadvantages.
But the one who are looking beyond their borders will find a richness of
NoSQL technologies and products.
Every product has its own properties and characteristics. How can we
differentiate them? Is it all about smart decisions, or do we have more
possibilities? We will go into the world of NoSQL and explain the different kind
of NoSQL products, when to use them and what is about polyglot persistence
to be.
6. Michael Lehmann @lehmamic
Senior Software Engineer @Zühlke since 2012
.Net enterprise and cloud applications
Roman Kuczynski @qtschi
Senior Software Engineer @Zühlke since 2011
Data(base) architectures, BI and Big Data
17. Roger Federer Roger Federer
Scaling by replication
Roger Federer
N. Djokovic N. Djokovic N. Djokovic
18. Impedance mismatch using relational databases
public class BlogPost
{
public int Id { get; set; }
public string Content { get; set; }
public List<string> Tags { get; set; }
}
19. Design for the relational model
public class BlogPost
{
public int Id { get; set;}
public List<Tag> Tags { get; set; }
}
public class Tag
{
public int Id { get; set; }
public PlogPost BelongsTo { get; set; }
public string Name { get; set; }
}
BlogPost
- Id (int)
- Content (varchar)
Tag
- Id (int)
- BlogPostId (int)
- Name (varchar)
20. NoSQL databases increase productivity
var post = new BlogPost
{
Id = 1,
Content = "Any text content",
Tags = new [] { "NoSQL", "Cloud", "PolyglotPersistence" }
};
collection.Insert(post);
39. The graph data model
Node [1]
Name = ‘John’
Node [2]
Name = ‘Sara’
Node [5]
Name = ‘Joe’
Node [3]
Name = ‘Maria’
Node [4]
Name = ‘Steve’
friend friend
friend friend
55. Common data tier design
Presentation
Domain
DAL
Resources RDBMS
Search
Transactions
Caching
Blobs
Triggers
Reporting
User Interface
Relational-ObjectObject-Relational
60. Resources
NoSQL Distilled
Author: Martin Fowler, Pramod J. Sadalage
ISBN: 978-0321826626
Making Sense of NoSQL
Author: Dan McCreary, Ann Kelly
ISBN: 978-1617291074
Links
http://nosql-database.org/
http://en.wikipedia.org/wiki/NoSQL
Do you know that? Put the hands up in the air!Almoust everybody is comfortable with SQLRoman
We developers are familiar with Relational databases and we know what we get when we use them. We are familiar using…SQL (Structures query language) SQL is a standardACID operations TransactionsRelational Schema that implies referential integrity and it’s constraints (PK, FK) to ensure data integrityData consistency must be ensured SQL is our holy cow (untouchable, established and accepted)Roman
Today Relational Databases such as SQL Server are a de facto standardWe choose from products (SQL Server, Oracle, MySQL...) not from technologiesFor (almost) every data persistency solutionThis reminds us to the Swiss army knife that can be used for everything.But what would you say, if you build a house and the electrician arrives with a Swiss Army Knife only?Michael
Before we dive deeper let’s introduce ourselves.Michael:Is a senior software engineer at Zühlke since 2012Focuses on enterprise and cloud application development in .NetRoman:Is a senior software engineer at Zühlke since 2011Focuses on data(base) architectures and technologies including BI and Big DataAgenda:For the next 40’ we are going to talk about:A short briefing of aternativenosql technologiesAnd what it means when use nosql databases in a polyglot persistence environmentBeide
Sure you may ask:Why should we decide for anything else than a SQL database?In almost every case we are fine using SQL.What reasons do we have to use another technology where everybody has to learn something new?Michael
You are right to challenge our statement to use other technologies than SQL.But nowadays we have other circumstances than a few years ago.(with cloud computing and BigData we have..)There are some business drivers that challenge RDBMS.We didn’t invent those business drivers.Let’s have a look at most reasonable drivers.Michael
One of the main drivers is Big Data.Hence, our first driver is Volume:Amount of data: from MB TB PBMore users, applications or devices accessing dataRDBMS reached their physical or financial limits. we need scalable and affordable solutionsRoman
Our second driver is velocity: Encreasing incoming data frequencyWe want to work everytime with the “newest” and up to date data without delays (think of a tweet arriving to late) From daily data to “realtime” The speed the data arrives and getting out, in other words the whole data lifecycle gets accelerated.Roman
Another aspect is Variability:Data origin from different devices, user produced or generally sources (mobile data, web content, sensors…)Often unstructured data, the amount of structured data is relatively small.Roman
Agility is our last driver and second main aspect beside Big Data.Productivity of development workResponsive for changes Time to market and low costsRoman
All these business drivers challenges and put pressure onto relational databases.Let’s see what options do we have to encounter those problems Michael
NoSQL database emerged in the market to meet those business driversHence, the characteristics of NoSQL databases address these problemsThere are main 2 reasons to use NoSQL:ScalingProductivityMichael
Why scaling?First of all most NoSQL technologies are built to scale out well.Scaling out means:We build a database cluster, based on commodity hardware (cheap), and not only spend money in a single box solution.Hence, we get availability, better performance due to load balancing, and capacityMichael
There are two techniques for scaling:With Sharding we distribute our data to different nodes: e.g. customer [a-m] on node #2 and customer [n-z] on node #2Some of the NoSQL database even provide autosharding, means we just can add new nodes to the cluster and the database will redistribute the data.Michael
There are two techniques for scaling:With Replication we duplicate our data on different nodes. This gives us a better failover and increases performance with parallel access.Michael
As a developer, we want to store our in-memory object structures.Using Relational databases, we have to map our objects to a relational model.Here we talk about the so called impedance mismatch. => a simple example list of strings
In general, NoSQL databases are schemaless, what brings us advantages in gaining development productivity.Usually data is stored in XML or Json format. This allows us to store our objects straight forward.Data format is version tolerant. If we change our data structures in code (i.e. we add or remove a property), NoSQL database does not care about it. In a relational database we have to update the schema and migrate the data as well.Schemaless does not mean, having no schema. We do have an implicit schema, but not enforced by the database! 2. FoliefürSchemalos und versionstolerantMichael
Let’s talk about consistencyScaling out and the fact, that no schema is present, influences data consistency.1. Having no schema means: We cannot enforce data integrity by the database.Roman
NoSQL databases generally don’t support ACID Operations (transactions)They rather provide eventual consistentcy.Roman
Let’s illustrate this by an example:We want to book a hotel room and we see the room is free.Roman
We book the room.On ZH server the room is still free, because update has not been processed yet.Roman
Data is inconsistent for a short period of time.We call this the inconsistency window!As soon as updates are processed on ZH, data will be eventually consistent!Means: Nobody else has booked the room on ZH server before updates had been processed!Roman
BUT: What happens, if someone else has booked the room on ZH before synchronization?Roman
To avoid conflicts:We have to wait for a commit of ZH to complete the updateWe define one server as the master, and only the master accept changes.BUT: In case of such a conflict, do we need strong consistency?What is more important? No conflict or risk to lose customers?Roman
we may handle conflicts by business discount, spare roomsRoman
To avoid cpnflicts we take “latency time” into accountWe have to reconsider when strong consistency is required Hence, we have to balance between performance and strong consistencyRoman
At the beginning we spoke about the Swiss army knife. Now we want to discover the tools available to build our solution. Let’s open the tool box.Roman
We have a great variety of NoSQL database products.For example:- Riak, MongoDB, Cassandra, HBase, Neo4J, etc.Roman
There are so many products, there is no way to no all these databases in detail.But we can classify them along their main characteristics.Nowadays 4 well known NoSQL Categories got generally accepted in the NoSQL community.Michael
We start with the most simple databases are so called Key-Values Stores. Originally developed for distributed caches (e.g. web sites)- Typical characteristics are, that the data is stored and accessed by a unique key, comparable to c# dictionary.- The data is just a bucket (or a set) of any data, which generally can not be queried.- Very fast to write and read data.- Easy to scale.Typical products: Memcached, Redis, RiakMichael
The most widely used databases are document stores.Typical products: CouchDB, MongoDB, RavenDB (written in .net!), (OrientDB hybrid, also Graph database)- As the name implies, these databases store the data as documents. The whole document is a serialized object tree (Aggregates),which makes this kind of databases very intuitive and easy to work with.Michael
Here is an example of such a document:- Generally the documents are stored in xml or json (bison) format.- These databases are query enabled, so we can search for a value of a given property in hierarchical documentsand we can apply indexes on data fields to optimise the queries.Michael
Column-family stores are the most close to table like in relational databases. They are also known as Wide column databases or big tables.Typical products: Cassandra, Hbase, HypertableRoman
Column family stores are semi-schematic:- You can think of a column-family store as one huge, big table with lots of columns- These columns are organised in so called “Column Families”, which are equivalent to tables in RDMBS.- Data is stored as rows and accessed by a row key and the column name.- Not every row has to contain the same columns- Even more, columns can dynamically added or removed to a row.Roman
Last but not least, graph databases are a bit exotic.Typical products: Neo4J, Infinite Graph, OrientDBMichael
Relational databases and the NoSQL categories we discussed already are not strong in modelling complex relationships.Imaging you have a graph like this (see example). How would you model and query the nodes and relationships in SQL?You see there are some limitations. Queries to traverse a graph would decrease the performance drastically.Using RDBMS, the data model defines how to query the relationships.Graph databases have another architectural approach and focuson the relationships of the data. Data is modelled as a graph with nodes and edges. Edges are the relationships between the nodes and can contain data as well.Special query languages like Cipher for Neo4J allow to traverse the graph intuitively.Example: see illustrationMichael
After this excurse, we now have a variety of databases to store our dataNot only NoSQL databases, but also the Relational databases. Yes, they still do have a right to exist, and they have a place in our toolbelt.Examples:Facebook, EBay CassandraCERN for ATLAS Detector used for Large Hadron Collider CassandraForbes.com for articles MongoDBSalesforce Marketing Cloud MongoDBAdobe, HP Neo4JRDBMS still can be used!Michael
But now, which database is the best for my application.All of the databases have their advantages and disadvantages for a certain scenario.Roman
Do we really have to pick only one database? In any scenario, we have to take trade-offs into account. Actually, we want to use the most accurate database for every job.Roman
Why we don’t do this?We can use different databases in an application for different storage scenarios.Martin Fowler calls this polyglot persistence.Michael
Title:Use the right tool for the jobScenario webshop:Caching:Redis (KV)Session storage: Redis (KV)Shopping Cart: Redis (KV)Product Catalog: RavenDB (Doc)Recommondation Engine: Neo4J (Graph)Financial Transactions: MS SQL Server (RDBMS)Reporting: MS SQL Server (RDBMS)Event Logging: Cassandra (CF)Michael
This is nice.But it introduces some new issues.Roman
First of all, we need people whoHave Knowledge in developing with these databases (API, characteristics in detail, advantages & disadvantages)Have Knowledge in operating these databases (administration, installing & upgrading, monitoring, backup & restore, performance tuning, storage management)Roman
Beside the skills, polyglot persistence impacts our architecture.Using multiple databases in the same application (system) increase complexity in the code and architecture.We need to design the architecture to handle this complexity. Roman
In the past SQL databases had been used as so called integration databases:Relational databases have a common, platform independant interface SQLMultiple applications accessed the same databases. So databases acted as integration platforms.The database was the master of the data, and the model ensured data quality (consistency, integrity)Using NoSQL databases, we have different circumstances:NoSQL database cannot enforce the schema and strong consistency.This requires that the application becomes the master of the data and is responsible to ensure schema and consistency.Roman
Polyglot Persistence does not match with the integration database idea:We don’t have a master database anymore.Means: We cannot ensure data quality across multiple databases, if they are accessed by multiple applications.Roman
To overcome this problem, we need a central unit being the master of the data.Here application databases come into account:The application owns a database and is the master of the data.Hence, the data can only be accessed by exactly one application and no other application accesses the database.Roman
In an enterprise environment we not only have one single application, but multiple applications using the same data.Since application databases don’t work with multiple applications, we need an enhancement of that design!Michael
We all know about the term SOA.A service acts as an application in the “application databases” scenario:The service is master of the data and can ensure the schema and consistencyMultiple applications can access the service.They never access the data(base) directly!Michael
We discussed the approach to ensure data quality and schema with polyglot persistence.Now let’s have a look at how we can avoid a mess in our code.The key is that we structure our code in layers.Michael
For this we use the commonly well-known layer architecture:PresentationBusinessResource Access (DAL)Resources (databases)Michael
Usually, the data access layer contains all of the logic to access the data.Bloated, heavy and application-specificNot reusableExpensive to maintainMichael
Example reusability of a chair:Object-related vs. interface-related the interface can be reused, not the object! Interfaces are reusable!!!!!!!!!! Not objectsMichael
Especially with NoSQL databases we should to aspire towards small reusable data services:Provide functionality accessing only one database.Example: Caching Service.Because these services provide not only data access, but also business logic (domain-independent), they are part of the business layer.Hence, the former bloated data access layer gets split into several independent services and is moved to the business layer.In Fact, the services have it’s own DAL, that’s why we placed we placed it across the layers.Data services are domain-independent. They don’t implement domain-specific domain logic!Michael
We now have several independent data services:They can be used by multiple applications.Data services ensure data consistency and schema in their owned databaseData consistency across several data services is not guaranteed by the data services.BeispielWebshop (Catalog/Shopping Cart/Order System/Recommendation Engine/Financial Transaction)A few minutes ago we talked about a central unit to ensure data consistency across several databases.To meet this requirement applied to our data services, we introduce business services:They provide the domain logic and use one or even more data services.Responsible to guarantee data consistency.Michael
Relational Databases give us confidence, because they are established and approved and robust. But that does not mean, that we use it for everything.At the very beginning, we compared our affection to Relational Databases with a Swiss army knife (one tool fits all).Now we have a toolbox full of individual tools to do our job, like a professional.That means, we have a variety of database technologies and products.We know about their strengthAnd how to use them REMEMBER: Use the right tool for the job!Roman
NoSQL DistilledMaking Sense of NoSQLhttp://www.datastax.com/documentation/gettingstarted/index.htmlhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//archive/bigtable-osdi06.pdfhttp://nosql-database.org/http://en.wikipedia.org/wiki/NoSQLRoman