2. Before we start our journey a bit about a bit, a byte and lots of bytes.
• A bit (b) is short for binary digit, after binary code (1 or 0) computers use to store and process data.
• Binary means base of 2 just like decimal means the base of 10.
• Byte (B) is the basic unit of computing used to create an English letter or number in computer code. One Byte is
equal to 8 bits
Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte
Unit Bit (b) Byte (B)
(KB) (MB) (GB) (TB) (PB) (EB) (ZB) (YB)
1,000 bytes 1,000 KB 1,000 MB 1,000 GB 1,000 TB 1,000 PB 1,000 EB 1,000 ZB
Size 1 or 0 8 bits
210 bytes 220 bytes 230 bytes 240 bytes 250 bytes 260 bytes 270 bytes 280 bytes
• One page of typed text is roughly 2KB.
• All books catalogued in the US Library of Congress total around 15 TBs.
• Google processes about 1PB every hour.
• Monthly internet data flows at around 21 EBs.
• Total amount of information in existence is around 1.2 ZB.
• YB is currently too big to imagine (as per The Economist).
• International Bureau of Weights and Measures sets the name of the prefixes.
2
3. A perfect storm of forces is conspiring to generate a lot of data.
Data storage costs are falling… …data creating devices are growing…
# of hosts
$/TB
Time Time
…data processing costs are falling… …connectivity is growing…
Large volume
of data of rich
Degree variety at
“Big Data”
$/GFLOPS of
connectivity various
speeds
Time Time
…data moving costs are falling…while… …along with performance expectations.
Speed of response
$/Mbps
Time Time
3
Please note that the slope of the various lines are different but they are directionally correct.
4. Almost everything is instrumented which means data is being generated in
various formats at various speeds and in various volumes.
• Structured data (tables, records)
• Semi-structured data (XML and
similar standards)
• Complex data (hierarchical or
legacy sources)
• Event data (messages)
• Unstructured data (human Volume
language, audio, video)
• Social media data
(blogs, tweets, social networks)
• Web logs and click streams
• Spatial data (long/lat, GPS)
• Machine generated data Velocity Variety
(sensors, RFID, devices, server
logs)
• Scientific data
(genomes, proteinomics, astronom
y)
4
5. Now all this data is pure cost unless it is transformed into information from
which insights can be drawn and right action taken to create or protect value.
• The information value chain depicts the various stages in the journey of data from its creation to use:
Data Information Insights Decisions Action Value
• At each stage of the value chain the right mix of business processes, human skills and technology capabilities are
needed.
• Relational database management systems (RDBMS) date back to the early 70s. RDBMS have worked well to
handle transactional and structured data because this type of data can be stored in table format with relationships
between and amongst the tables. The technology to manage RDBMS was developed at IBM (in San Jose) and
was initially called SEQUEL (Structured English Query Language). Now called SQL
• As more of the data generated shifts from structured to other formats the traditional methods of managing data are
not practical.
• So here is what has happened in the management of data over time.
– Vertical scaling…bigger RDBMS machines…more disk space, more horse power, big data centers.
– New methods, called Horizontal scaling, arrived as vertical scaling reached its limit from a data volume
standpoint…so came Massively Parallel Processing (MPP) machines
– But then came unstructured data (variety) and streaming data (velocity) so what was needed was a whole
new way to manage data…Big Data (BD)
5
6. How do RDBMs really work (for the most part).
• Multiple interfaces
• Slow…disk drives need time to read-and-write
• Sequential
• Indexing a big challenge
• Schema is not flexible
Data is generated Data is Data is analyzed
Data is stored in Information is
in multiple aggregated in in analytical
databases reported
channels data warehouses applications
• So the solution is to remove all these boxes (no pun intended) and get analytics as close as possible to the data.
Hence, you hear terms like in-database analytics (analytics moving into d/b) or in-memory analytics (d/b moving
into memory)
Data is generated
Data is stored, aggregated and analyzed on a single Information is
via multiple
platform reported
channels
6
7. RDBMs cannot scale because their intrinsic constraints run up against a
humbling rule that you cannot have everything in life and you have to chose.
• RDBMS rely on the ACID principle
– Atomicity: All or nothing
– Consistent: All transactions take d/b from one state to another without impairing referential integrity
– Isolation: Other operations cannot access data while transaction is midstream
– Durability: Ability to recover from system failure
• Vertically scaled RDBMs do honor the ACID principle but horizontally scaled RDBMs (MPP machines) do not. This
is called the CAP Theorem. It says that you can have any two of the following three when you have a distributed
RDBM system
– Consistency which means you operate fully or not at all.
– Availability which means a node failure does not prevent surviving nodes from completing the task.
– Partition tolerance (the distributed part) which means that system continues to operate despite arbitrary
message loss.
• The two bullets above mean that as you scale RDBM system you run into a wall…actually a cap!
7
8. Therefore, RDBMS are not good at performing all types of analysis.
• We need scalable database models that are not dependent on a fixed data schemas.
App App App App App
Need for a
new data
architecture
App
Db Db Db Db Db Db Db
App
Db
Db
Vertical scaling Horizontal scaling Schema agnostic
scaling
Volume growth
Velocity growth
Variety growth
8
9. The rich variety of data intruded to make data management a painnus posteriorus*.
• While the volume and velocity of the data is Volume vector…..bad
growing rapidly it is the growing variety of data Velocity vector…badder
that is a complexity multiplier in the
management of all these bits.
• RDBMS and MPP approaches exhausted the
ability of current architectures to process the
torrent of bits flowing.
• Hence arrived what I call Big Data
Architecture (BDA)
• BDA does not replace existing investments in
data management; BDA complements them
so no need to rip-and-replace; it is more Variety vector…baddest
insert-and-augment.
• BDA started in companies that had BD,
essentially internet companies like Yahoo,
Google, Facebook, Amazon, Twitter, LinkedIn
that needed web-scale solutions to their data
problems. They built this from scratch
because there was nothing commercially
available.
• This revolution was called NOSQL (Not Only SQL)
• The “NO” means that it is a technology that works in addition to SQL not instead of it.
• NOSQL databases were organically developed…these are essentially schema agnostic…meaning that some of
the constraints of SQL databases are negotiated well.
*: painnus posteriorus is a contemporary acute discomfort of lower thoracic induced by unrelenting bit storms
9
10. NOSQL solves the complexity, volume and speed constraints of an SQL design
by using four different data models.
• Key value stores is a schema less model of storing data
• Big table clones is a compressed high performance database system based on Google File System.
• Document databases is a method to store semi-structured data
• Graph databases uses graph structures (nodes, edges etc.) that provides index free lookups.
NOSQL model
Document
Key value stores Big table clones Graph databases
databases
Based on Based on Based on Based on
Amazon Dynamo Google BigTable Amazon Dynamo Graph Theory
Memcached Hbase Lotus Domino AllegroGraph
Dynamo Cassandra CouchDB VertexDB
Voldemort HyperTable MongoDB Neo4J
Tokyo Cabinet AzureTS Riak Active RDF
10
11. BDA is actually very effective.
• Yahoo tested BDA by calculating Pi to 2,000,000,000,000,000th digit
• It used 1,000 computers and the calculation took 23 days. This means 23,000 computing days.
• Using RDBMs, it would have taken on PC about 500 years which is essentially ~182,621 computing days. Now
that is ~87% improvement in speed (using a very rough back of the envelope calculation)
• So yes, BDA works.
11
12. BDA works by breaking a problem into pieces, analyze each piece separately
and then aggregating the results into a single response.
• HADOOP is an instance of NOSQL that has two main parts: MapReduce and HDFS
• MapReduce means mapping a problem to worker nodes and then aggregating (reducing) the results
• HDFS is the file management systems that makes MapReduce work
Map phase Reduce phase
• Google searches
• Amazon recommendations
Piece 1
• Paypal real time fraud detection
• Credit card unauthorized charges Piece 2
• Loopt
Worker nodes
Master node
Master node
• Directions from office to bar/pub…nearest Piece 3
vs. cheapest Problem Result
• Genomics searching (needle-in-a-haystack) Piece 4
• Zynga gaming
…
• Facebook Friends
• LinkedIn People-you-may-know (PYMK) Piece n
• GPS directions (as you drive)
• …
12
13. What does BDA landscape look like?
• It depends on what the need is but here is a simple graphic that shows the various elements. This is only
illustrative.
Data
Visualization/Mobile/R
presentation
Displaying and monitoring logs: Chukwa
Job tracker
Data processing Hadoop (batch); S4, Storm (streaming)
Coordination: Zookeeper
Data query Pig, Hive
Processing
Azkaban, Oozie
scheduler
Task tracker
Database Voldemort, Cassandra, HBase
Data collection Kafka, Flume, Scribe
13
14. BDA architecture does not mean you need to throw away your investments in
traditional data analytics infrastructure.
• BDA works alongside existing investments made by companies…not rip-and-replace!
Traditional BI infrastructure
Reporting
&
Distribution
BDA
14
15. Even NOSQL is getting challenged, but for now we got-to-dance-with-them-
what-brung-you.
• Zynga needs additional 1,000 servers every week for their data needs.
• Every search string you send to Google is divided and sent to 700-1000 servers so that you can get your response
back in micro-seconds and thus not waste a few seconds in which you could have destroyed civilization.
• Youtube serves 1 billion videos every day.
• 2.5 billion photos uploaded each month to Facebook.
• ~150,000 zombie computers created every day (used in botnets for sending spam)
• At beginning of 2009 there were 187 million web sites. At the end of 2009 there were 234 million web sites. 25%
growth.
15
16. And what is next.
Big Data + Context + Interactivity =
16
20. New skills you should consider in the world of Big Data
– Cultivate expertise but be a strong generalist
– Develop and grow relationships and networks
– Develop communication skills
– Refine presentation skills
– Read up, a lot
– Monitor competition
– Understand business, I mean really understand it
Embrace*
– Love the edge
– Step outside your comfort zone, frequently
ambiguity
– If you have the appetite, read up a book or two on statistics
– Think laterally, this just means do not be afraid to connect the dots
* At a minimum, learn to accept ambiguity
20