Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Big Data using NoSQL Technologies
1. Big Data using NoSQL Technologies
Amit Kr. Singh
Senior Developer, Ericsson
December 14, 2012
2. My Background
Part of Java and Open Source Practice Area.
Driving technology initiatives in LockBox project.
Part of System-X development team.
Contributing in JOSP Competence Development & Training.
4. Big Data
Ericsson defines:
“People,devices and things are constantly generating massive volumes of data.
At work people create data, as do children at home, students at school, people
and things on the move, as well as objects that are stationary. Devices and
sensors attached to millions of things take measurements from their
surroundings, providing up-to-date readings over the entire globe – data to be
stored for later use by countless different applications.”
5. Big Data
IBM defines:
“Every day, we create 2.5 quintillion bytes of data — so much that 90% of the
data in the world today has been created in the last two years alone. This data
comes from everywhere: sensors used to gather climate information, posts to
social media sites, digital pictures and videos, purchase transaction records, and
cell phone GPS signals to name a few. This data is big data.”
6. Big Data
Wikipedia defines:
Big data is a collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools. The challenges include
capture, curation, storage, search, sharing, analysis and visualization.
7. Big Data
Why so many definitions? I am really confused.
8. Big Data
In simple words, A set of technology advances that have made capturing and
analyzing data at high scale and speed vastly more efficient.
11. Six insights from Facebook's former
Head of Big Data
Analytics on 900M users
25PB of compressed data – 125 uncompressed
New technologies has shifted the conversions from “what data to store” to
“what can we do with more data”.
Simplify data anlytics for end users.
More users means data analytics system have to be more robust.
Social networking works for Big Data.
No single infrastructure can solve all Big Data problems.
Building software is hard, but running a service is even harder.
16. The Three Vs of Big Data
Volume – big data comes in one size XXL and available storage cannot handle
these volumes.
Velocity – data needs to be used quickly to maximize business benefit before
the value of the information is lost.
Variability – data can be structured, unstructured, semi-structured or a mix of all
three. It comes in many forms including text, audio, video, click streams and log
files.
17. Big Data Technologies
Big-data technologies are usually engineered from the bottom up with two things
in mind: scale and availability. Most solutions are distributed in nature and
introduce new programming models for working with large volumes of data.
Technologies such as Not only SQL (NoSQL), characterized by its non-
adherence to the RDBMS model, used in a wide variety of industry applications.
These technologies have the flexibility to handle Big Data.
18. Scalability
Scalability refers to the ability of an application or product to increase in size as
demand warrants. The base concept is consistent – the ability for a business or
technology to accept increased volume without impacting the business settings.
Scale horizontally (scale out)
Scale vertically (scale up)
19. Scalability
Scale vertically (scale up)
Extra capacity can be obtained by adding more hardware to a specific computer
or by moving applications to larger computers – a process known as vertical
scaling. One limitation of this approach is the risk of outgrowing the capacity of
the largest computer; this will eventually affect cost. Vendor lock-in is a potential
risk, and vertically scaled solutions can become prohibitively expensive.
20. Scalability
Scale horizontally (scale out)
Adding computers in parallel can also increase capacity. This approach is known
as horizontal scaling, and Big Data technologies tend to favor it because it
supports network expansion. Systems that are built in this way are more flexible,
and because commodity computers can be operated together in parallel, the risk
associated with single vendor solutions is reduced. Also horizontal scaling is built
for Cloud.
21. Availability
Availability is a guarantee that every request receives a response
about whether it was successful or failed.
Users want their systems (Facebook, Twitter, Telecom app, etc) to be ready to
serve them at all times. If a user cannot access the system, it is said to be
unavailable. Generally, the term downtime is used to refer to periods when a
system is unavailable.
22. NoSQL
What NoSQL databases can:
Serve as an online processing database, so that it becomes the primary
datasource/operational datastore for online applications.
Use data stored in primary source systems for real-time, batch analytics, and
enterprise search operations.
Handle “big data” use cases that involve data velocity, variety, volume, and
complexity.
Excel at distributed database and multi-data center operations.
Offer a flexible schema design that can be changed without downtime or
service disruption.
Accommodate structured, semi-structured, and non-structured data.
Easily operate in the cloud and exploit the benefits of cloud computing.
23. Is NoSQL replacing the RDBMS?
The answer is both yes and no, considering that the choice
between the two depends on the Use Case.
NoSQL doesn't take advantage of ACID properties. Applications which depend
on transaction support (Banking, Airlines etc) will continue to work with RDBMS
while Social Media applications which mostly deal with unstructured data will look
at alternative NoSQL solutions. However hybrid architecture may prove
beneficial as well where the power of both RDBMS and NoSQL can be
leveraged.
24. Is NoSQL replacing the RDBMS?
However many enterprises are choosing to leave some legacy RDBMS systems
in place, while directing new development towards NoSQL databases. This is
especially the case when the applications in question demand high write
throughput, need flexible schema designs, process large volumes of data, and
are distributed in nature.
Technology aside, another reason many new development and/or migration
efforts are being directed towards NoSQL databases is the high cost of legacy
RDBMS vendors versus NoSQL software. In general the fact is that, NoSQL
software is a fraction of what vendors such as IBM and Oracle charge for their
databases.
25. RDBMS & Big Data
Tactics to extend the useful scope of RDBMS technology
Sharding
Denormalizing
Distributed caching
26. Sharding
If the data for an application will not fit on a single server or, more likely, if a
single server is incapable of maintaining the I/O throughput required to serve
many users simultaneously, then a tactic known as sharding is frequently
employed.
Database sharding is the process of splitting up a database across multiple
machines to improve the scalability of an application.
27. Sharding
This does work to spread the load but there are some undesirable
consequences to the approach.
When you fill a shard, you have to change the sharding strategy in the
application itself. For example, placing user profile information on one database
server, friend lists on another and a third for user generated content like photos
and blogs. The main problem with this approach is that if the site experiences
additional growth then it may be necessary to further shard a feature specific
database across multiple servers.
You lose some of the most important benefits of the relational model. You can’t
do “joins” across shards. In addition, you can’t do cross-node locking when
making updates.
28. Denormalizing
Denormalization is the process of attempting to optimise the read performance of
a database by adding redundant data or by grouping data. In some cases,
denormalisation is a means of addressing performance or improving the
scalability in relational database software.
Most of the time denorm is application-specific and needs to be re-evaluated if
the application changes.
Denorm can increase the size of tables.
29. Distributed Caching
Another tactic used to extend the useful scope of RDBMS technology is to
employ distributed caching technologies, such as Memcached. Today,
Memcached is a key ingredient in the data architecture behind 18 of the top 20
largest (by user count) Web applications, including Google, Wikipedia, Twitter,
YouTube and Facebook.
Memcached “sits in front” of an RDBMS system, caching recently accessed data
in memory and storing that data across any number of servers or virtual
machines. When an application needs access to data, rather than going directly
to the RDBMS, it first checks Memcached to see if the data is available there; if it
is not, then the database is read by the application and stored in Memcached for
quick access next time it is needed.
31. Distributed Caching
Memcached and similar distributed caching technologies used for this purpose
are no magic and can even create problems of their own:
Memcached was designed to accelerate the reading of data by storing it in
main memory, but it was not designed to permanently store data. Memcached
stores data in memory. If a server is powered off or otherwise fails, or if memory
is filled up, data is lost.
Again another tier to manage. It should be obvious that inserting another tier of
infrastructure into the architecture to address some (but not all) of the failings of
RDBMS technology in the modern interactive software use case can create its
own set of problems: more capital costs, more operational expense, more points
of failure and more complexity.
32. NoSQL Technologies
Sharding, Denormalizing, Distributed Caching and other tactics are all attempt to
paper over one simple fact: RDBMS technology is a forced fit for modern
interactive software systems. Because vendors of RDBMS technology have little
incentive to disrupt a technology generating billions of dollars for them annually.
Few application developers from Google (Big Table) and Amazon (Dynamo) took
initiatives and invented, developed No SQL database technologies.
33. NoSQL Characteristics:
No schema required. Data can be inserted in a NoSQL database without first
defining a rigid database schema. As a corollary, the format of the data being
inserted can be changed at any time, without application disruption. This
provides immense application flexibility, which ultimately delivers substantial
business flexibility.
Auto-sharding. A NoSQL database automatically spreads data across servers,
without requiring applications to participate. Servers can be added or removed
from the data layer without application downtime. Most NoSQL databases also
support data replication, storing multiple copies of data across the cluster, and
even across data centers, to ensure high availability and support disaster
recovery.
34. NoSQL Characteristics:
Distributed query support. “Sharding” an RDBMS can reduce, or eliminate in
certain cases, the ability to perform complex data queries. NoSQL database
systems retain their full query expressive power even when distributed across
hundreds or thousands of servers.
Integrated caching. To reduce latency and increase sustained data throughput,
advanced NoSQL database technologies transparently cache data in system
memory. This behavior is transparent to the application developer and the
operations team, in contrast to RDBMS technology where a caching tier is
usually a separate infrastructure tier that must be developed to, deployed on
separate servers, and explicitly managed by the ops team.
35. Research activities in Big Data
The White House has recently announced a national "Big Data Initiative" for
improving the ability to extract knowledge and insights from large and complex
collections of digital data. This initiative will help US goverment in scientific
discovery, environmental and biomedical research, education, and national
security.
NASA is working on number of innovative approaches to advancing Big Data,
including the Lunar Mapping and Modeling Activity