The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Age of Exabytes: Tools & Approaches for Managing Big Data
1. The Age of Exabytes
TOOLS AND APPROACHES FOR
MANAGING BIG DATA
WRITTEN BY AUDREY WATTERS
SPONSORED BY
2.
3. Contents
Introduction: The Rise and Scope of Big Data 3
Innovations in Storage 5
Storage: At the Chip Level 5
Storage: At the Data Center Level 6
Storage: Virtualization and the Cloud 7
Storage: Big Data, New Databases 7
Speed: Big Data, Real-Time 9
The Demand for Big Data Analytics 11
Accessing the Data 13
Via the API 13
Over the Network 13
Use Cases 15
Distributed Computing with CouchDB at CERN 15
Real-Time Retail Analytics 15
Millions of Farmvilles Mean Petabytes of Data Daily: How Zynga Handles Social Gaming Big Data 16
The Big Data Marketplace 17
Bigger Data and a Better Response: Earthquake Detection & Crisis Response 17
Conclusion 19
This premium report has been brought to you courtesy of HP Networking. As you
explore networking solutions for your enterprise, don’t miss HP’s helpful resource
located at the end of this report, HP’s FlexFabric, which explores the next-generation,
highly scalable data center fabric architecture.
ReadWriteWeb | The Age of Exabytes | 1
5. Introduction: The Rise and Scope of Big Data
To bytes, the basic unit of computing, we have rapidly added
new prefixes as the development of computer technology has
hastened the units of storage. From kilobytes (1000 bytes),
we’ve moved on to megabytes (1000 KB), gigabytes (1000 MB),
and terabytes (1000 GB) of data to “big data,” petabytes (1000
TB), exabytes (1000 PB), zettabytes (1000 EB), and to the as yet
unfathomable yottabyte (1000 ZB). This year, estimates put the
amount of information in existence at 1.27 zettabytes. One page
of typed text, by comparison, is roughly 2 kilobytes of data,
while all the books catalogued in the U.S. Library of Congress
total around 15 terabytes. Dwarfing that is the approximately 1
petabyte of data processed per hour by Google.1
These numbers, this amount of data, while almost mind-boggling, are nonetheless growing at an
exponential rate. Eight years ago, there were only around 5 exabytes of data online.2 Just two years ago,
that amount of data passed over the Internet over the course of a single month. And recent estimates put
monthly Internet data flow at around 21 exabytes of data.3
Certainly, some industries, such as science and finance, have long had to wrestle with storing and
processing massive amounts of data. But even there, the need for more speed and more storage has
grown. Walmart, for example, must handle more than 1 million customer transactions per hour. The
process of decoding the human genome required the computing power to analyze 3 billion base pairs —
something that took 10 years the first time it was done in 2003, but can now be achieved in one week.4
Clearly, to meet these sorts of needs, computing power and storage has improved substantially — a
marker of Moore’s law, which dictates that the processing power and storage capacity of computer chips
double or their prices halve roughly every 18 months. And the technology has in turn has facilitated this
explosion of data. But that’s only part of the picture.
1 http://www.economist.com/node/15557421
2 http://www.readwriteweb.com/archives/the_coming_data_explosion.php
3 http://www.pcmag.com/article2/0,2817,2361820,00.asp
4 http://www.economist.com/node/15557443
ReadWriteWeb | The Age of Exabytes | 3
6. The data that is being generated today isn’t just “big,” it’s different, and much of it is unstructured. Older
collections of data are now being digitized, such as the efforts of Project Gutenberg to digitize and
archive the world’s literary works. And many more people than ever before have access to technology
tools. The UN estimates there are an estimated 5 billion mobile phone subscriptions worldwide
(although many people have more than one, so that doesn’t quite mean that the mobile phones have
so completely saturated the world market of 6.8 billion people).5 Billions of people use the Internet,
and with the rise of digital literacy and of social networking, more and more people are creating and
uploading more and more data. There are 500 million registered Facebook users, for example, sharing
3.5 billion pieces of content weekly and uploading 2.5 billion photos every month, of which Facebook
in turn serves up at a rate of about 1.2 million photos per second.6
With the increase in mobile device use in particular, human data creation has soared. Add to that
the input from radio-frequency identification (RFID) and wireless sensors — the 35-some-odd billion
devices connected to the Internet that are a source of information that is predicted to outpace the
generation of data from humans — and clearly data gathering has become ubiquitous.7
This explosion of data — in both its size and form — causes a multitude of challenges for both people
and machines. No longer is data something accessed by a small number of people. No longer is the
data that’s created simply transactional information; and no longer is the data predictable — either
as it’s written, or when, or by whom or what it’s going to be read by. Furthermore, much of this data is
unstructured, meaning that it does not clearly fall into a schema or database. How can this data move
across networks? How can it be processed? The size of the data, along with its complexity, demand
new tools for storage, processing, networking, analysis and visualization.
This report will survey some of the developments underway to address these challenges: the
challenges of computing in the exabyte era.
5 http://www.physorg.com/news185467439.html
6 http://www.readwriteweb.com/cloud/2010/08/how-facebook-scales-with-open.php
7 http://www.readwriteweb.com/archives/cisco_futurist_predicts_internet_of_things_1000_co.php
4 | ReadWriteWeb | The Age of Exabytes
7. Innovations in Storage
STORAGE: AT THE CHIP LEVEL
Gordon Moore, the co-founder of Intel predicted in a research paper in 1965 that “the number of
transistors incorporated in a chip will approximately double every 24 months.” Moore’s Law, as it’s known,
is generally accepted by the computer industry that has seen the growth processing power and storage
capacity of computer chips. Many analysts, however, predict that the rate that data is being created today
is at a pace that will exceed Moore’s Law.
This poses a challenge to chip-makers who are researching new storage and storage reduction
technologies. After all, there are physical limitations to the miniaturization of transistors, a point that
some predict could be reached by 2020. So while Moore’s Law has driven the computer industry for over
40 years, if the storage capacity and processing power are to continue, innovations must occur not just in
terms of dimensions and scaling but in terms of alternate computing mechanisms and logic devices.
Hewlett Packard, for example, has reported advances in the design of a new class of diminutive switches
that would be capable of replacing transistors and help aid the shrinkage of computer chips closer to
the atomic scale.8 The devices, known as memristors, or memory resistors, are modeled along the lines
of biological systems. These are purportedly simpler than today’s semiconducting transistors, can store
information even in the absence of an electrical current and can be used for both data processing and
storage applications.9 Researchers also say they have devised a new method for storing and retrieving
information from a vast three-dimensional array of memristors, something that could allow designers to
stack switches beyond the limitations of two-dimensional scaling.
A different approach is being taken by researchers at IBM, Intel, and others, who are investigating a
type of storage called “phase-change memory.” PCM offers high performance along with low power
consumption, combining the best attributes of NOR, NAND and RAM — fast read and write speed, non-
volatility, bit-alterability and good scalability, for example — within a single chip. Unlike flash memory
technology, for example, PCM allows stored information to be switched from one to zero or zero to one
without a separate erase step. And unlike RAM, PCM does not require a constant energy supply.10
8 http://www.nytimes.com/2010/04/08/science/08chips.html
9 http://www.nature.com/nature/journal/v464/n7290/full/nature08940.html
10 http://www.enterprisestorageforum.com/technology/features/article.php/3862741/Phase-Change-Memory-The-Next-Big-Thing-in-Data-
Storage.htm
ReadWriteWeb | The Age of Exabytes | 5
8. And earlier this year, researchers at the Tyndall National Institute in Cork, Ireland announced they had
created the world’s first junction-less transistor. Current transistors are based on junctions, which are
formed by placing two pieces of silicon with different polarities side-by-side. Controlling the junction
allows the current in the device to be switched on and off. The new transistor technology uses a control
gate around a silicon nanowire that can tighten around the wire to the point of closing down the passage
of electrons without the use of junctions or doping.11
As researchers pursue different solutions to the question of building computer chips with better
processing and storage capabilities, they must address not just performance, but cost and power
consumption.
STORAGE: AT THE DATA CENTER LEVEL
The impact of Moore’s Law does not occur simply at a chip level, of course. The increase in computer
power at lower cost has, in part, spurred this data explosion, which in turn has demanded the building
of more computers, more servers, more data centers. So at the other end of the spectrum from the
innovations happening to storage at the chip level are the massive data centers that house thousands of
chips on thousands of servers.
While computing power has increased and the cost of chips has fallen, the cost of building and powering
data centers has increased dramatically. An analysis of Facebook’s spending posits that the company will
spend about $50 million this year on data centers — a figure that has more than doubled since similar
estimates for 2009.12 No longer is the bulk of the expense of those facilities merely a question of large and
powerful equipment. (In fact, those figures from Facebook do not include equipment). Rather, it is this
equipment’s skyrocketing demands for electricity for both powering and cooling.
According to some calculations, for every Watt of server power used at a well-managed data center,
an additional Watt is consumed by the chillers, air handlers, and so on. But in many cases the energy
consumed is much higher.13 According to Greenpeace, at current growth rates data centers and
telecommunication networks will consume about 1,963 billion kilowatts hours of electricity in 2020 —
more than triple their current consumption and more than the current electricity consumption of France,
Germany, Canada and Brazil combined.14
Energy consumption is prompting the search for more efficient ways of powering and cooling.
Data centers are being located in areas near alternative sources of energy, such as Google’s recent
announcement of a new center in Finland that will be cooled by sea water. Other facilities are
experimenting with using offset heat to warm nearby offices. Some researchers are investigating ways
that data centers can utilize energy from the heat to fuel cooling mechanisms, for example.15 And others
are building new and different containers for the servers so that they are less capital-intensive and can be
powered and cooled more efficiently.
11 http://www.gizmag.com/tyndall-national-institute-first-junctionless-transistor/14531/
12 http://www.datacenterknowledge.com/archives/2010/09/16/facebook-50-million-a-year-on-data-centers/
13 http://www.electronics-cooling.com/2007/02/in-the-data-center-power-and-cooling-costs-more-than-the-it-equipment-it-supports/
14 http://www.readwriteweb.com/cloud/2010/09/greenpeace-demands-facebook-un.php
15 http://www.readwriteweb.com/cloud/2010/08/turning-a-data-center-into-its.php
6 | ReadWriteWeb | The Age of Exabytes
9. STORAGE: VIRTUALIZATION AND THE CLOUD
One of the factors that has contributed to the explosion of data is the increasing adoption of
virtualization. Virtualization allows companies to take advantage of greater storage and processing
capabilities without having to run their own, physical machines. Virtualization, or cloud computing, has
created many opportunities for businesses to leverage the elastic computing to do things otherwise
not possible because of the costs of building and maintaining their own hardware infrastructure.
Although it’s common practice for many companies to move to dedicated data centers once they
reach a certain size, many companies are running quite sizable businesses on public clouds. Playfish,
for example, once of the largest social gaming companies, runs its operations with Amazon Web
Services.16
Cloud computing facilitates the speed with which new companies and new processes can be set up, as
new servers can be launched and scaled with ease. As cloud computing allows for scaling to happen
horizontally and not just vertically, it has, along with other developments in distributed computing,
provided new ways for thinking about how data can be stored and processed.
STORAGE: BIG DATA, NEW DATABASES
It’s no surprise that as data has grown, databases have had to adapt. One of example of the innovation
occurring in recent years is the number of new databases that break from the relational database
management system (RDBMS) model. The latter has a long history, dating back to the 1970s. In a
relational database, data is stored in the form of tables, as is the relationship among the data. This
system has worked well to handle transactional and structured data.
But as the amount of information, the kind of information, and the number of users accessing the
information have grown, the relational database has faced some challenges. With new data comes new
storage demands. And the traditional RDBMS is not optimized for the kind of environment that big
data and cloud computing have created — one that’s elastic and distributed.
Traditional RDBMS software, such as MySQL, can handle huge amounts of data but often requires
extensive knowledge to manage. MySQL in particular is well known by many developers and has
remained the data storage choice for many people. But a growing number of “NoSQL” — “Not Only
SQL” — alternatives have been developed in the last year or so. These databases are designed to be
Web-scale. They can be characterized as non-relational, distributed and horizontally scalable. Many of
them are open source.
Examples of NoSQL databases include CouchDB, MongoDB, Membase, and Redis.
Perhaps due to the acronym containing “No,” there has been skepticism about some of these new
technologies by those who do not want to abandon the relational database. Often, it’s not a choice
between only one or the other as many businesses operate with a combination, where some data is
stored in an RDBMS with other data better suited to a NoSQL datastore.
16 http://www.readwriteweb.com/cloud/2010/08/playfish-shows-how-games-as-a-.php
ReadWriteWeb | The Age of Exabytes | 7
11. Speed: Big Data, Real-Time
The storing of exabytes of data is only part of the challenge,
as the demands aren’t merely to be able to warehouse big
data, but to be able to process and analyze it. Furthermore, the
demands for read and write access are often real-time.
As with the necessity for the development of better storage, big data requires better processing power,
something accomplished at the level of the processor and up through the system. With the advent of
networking, one of the ways in which computational power is increased is by distributed computing.
That is, processing is not necessarily done in a single powerful mainframe computer, but is instead
distributed to a number of computers in clusters or nodes. With distributed computing, a problem is
divided into many tasks, each of which is solved by one computer.
According to one report, for example, an ordinary Google search query involves between 700 and
1,000 servers, all so that a response can come within a sub half-second.17
To perform tasks like this, Google has built MapReduce. MapReduce is a framework for processing
huge datasets by using a large number of computer nodes applied to certain kinds of distributable
problems. In this way, computational processing can occur on structured or unstructured data.
The advantage of MapReduce is that it allows for distributed processing of the map and reduction
operations. The terms “map” and “reduce” refer to steps the tool takes to distribute, or map, the input
for parallel processing, and then reduce, or aggregate, the processed data into output files. In other
words, during the map step, a master node takes the input and chops it up into small sub-problems,
then distributes those to worker nodes. In the reduce step, the master node then takes all the answers
to the sub-problems and combines them to get the answer to the problem it was originally trying to
solve.
Some have posited that MapReduce is inefficient, but a large server farm like those operated by
Google can use MapReduce to purportedly sort a petabyte of data in only a few hours. And the
MapReduce framework has been incredibly influential on the development of other new tools to
handle big data.
Another important tool recently developed to handle large amounts of data is Hadoop. Derived
from MapReduce, Hadoop is an open source project that, like MapReduce, handles large files across
multiple machines. Hadoop consists of two key services: MapReduce and a data-storage system called
the Hadoop Distributed File System (HDFS). A key feature of Hadoop is that for effective scheduling
17 http://news.cnet.com/8301-10784_3-9955184-7.html
ReadWriteWeb | The Age of Exabytes | 9
12. of work, every filesystem should provide location awareness — the name of the rack where a worker
node is. Hadoop applications can use this information to run work on the node where the data is, and,
failing that, on the same rack/switch, so as to reduce backbone traffic. The filesystem uses this when
replicating data, keeping different copies of the data on different racks with the goal of reducing the
impact of a rack power outage or switch failure. Even if these occur, the data may still be readable.
To illustrate: Hadoop was recently utilized to calculate the 2,000,000,000,000,000th digit of pi, more
than doubling the record of the previous longest calculation. Using a cluster of 1000 computers at
Yahoo, it took 23 days to calculate, something that would have taken over 500 years on a standard PC.
Rather than calculating of each digit, Hadoop allowed computers to work with a formula that turned
a complex equation for pi into a small set of mathematical steps. And then, in the end, the formula
returned just one specific piece of pi, that record-breaking digit (which is, incidentally, “0”).18
But Hadoop and MapReduce are batch processes, and as such can have high latency. At the scale of
big data, speed is assessed in terms of performance — the speed with which a system answers a query.
But just as important is the idea of “speed to insight,” that is the amount of time it takes for analysts to
glean insights from these massive data sets.
18 http://www.bbc.co.uk/news/technology-11313194
10 | ReadWriteWeb | The Age of Exabytes
13. The Demand for Big Data Analytics
“Success” in big data isn’t simply a matter of building and
implementing better storage or processing tools. Success
involves being able to gain insights from the big data — and
to gain it quickly. But the scale of the data does make search,
analysis, and visualization challenging — even more so with
the demands of real-time.
Analytics have often accompanied data warehousing for sectors like finance, retail, and research.
But just as big data creates challenges for databases and processing, it also poses new problems for
analytics. Traditional databases struggle with the complexity and poor performance that result from
trying to express complex analytics in SQL. So until recently, many advanced analytics were handled
outside the database. In other words, analytics procedures and models were run on statistical analysis
platforms — and so optimizations to the database wouldn’t necessarily speed up the analysis.
Furthermore, data needed to be copied and moved from the data warehouse to a statistical platform.
Between the constraints of disk speed and network bandwidth, moving big data out of a warehouse
can be slow, further compounded by the speeds it takes a statistical platform to process the data.
These challenges have been so severe that in many cases, the depth of the analysis is compromised.
This has occurred when big data is reduced — via sampling, for example — to smaller subsets for
computation, meaning that critical insights may be overlooked. Furthermore, developers have been
forced to spend a significant amount of time modifying complex analytics in order to fit with the
limitations of traditional databases. Arguably, traditional business intelligence applications are not
designed to handle the amount or the complexity of the data, nor are they necessarily built to handle
real-time reporting. As a result, the quality of the analytics suffers.
Rather than reports created on past events, analytics should be based on real-time data. And rather
than results that come from periodic reports created by statisticians, the need is that this information
be open for constant and on-demand analysis.
Big data analysis is changing in part due to in-database analytics, but database vendors like Aster Data
are beginning to add analytics to their feature lists. These vendors now support a range of analytic
queries that can be written in or converted to SQL, as well as those written in C/C++, Java, Python, Perl,
R, and other languages inside their database.
ReadWriteWeb | The Age of Exabytes | 11
14. In addition to demands to deliver complex analyses on big data, there is also increased interest
in visualizations. And again, as with database and analytic technologies, many of the existing
tools have not been designed to handle the massive quantities of data. Efforts like CalTech’s Large
Data Visualization Initiative are seeking to develop multiresolution visualization and modeling
technologies.19
The ability to perform analytics on big data in near-real-time will become increasingly important for
organizations, and the market opportunities are substantial for companies and data scientists who can
provide these services.
19 http://www.cacr.caltech.edu/
12 | ReadWriteWeb | The Age of Exabytes
15. Accessing the Data
VIA THE API
Moving large volumes of data around can be difficult for all the reasons explained above. The
requirements for moving data have necessitated development on a couple of levels: in terms of
networking and in terms of the API.
APIs aren’t designed necessarily to solve a company’s big data problem. Nonetheless, they can be
utilized in a number of ways to offer access to developers to all or part of a company’s data. And
as companies generate and store more data and as data becomes a more important commodity,
having an API becomes more and more important. An API allows companies to open access to this
information to not simply internal analyses and processes, but to other third-party developers as well.
Having an API has become “BizDev 2.0”. In other words, in a Web-oriented world, it’s the way business
development is done. APIs facilitate business-to-business relations by opening data and systems
to business partners. And having an API makes new queries possible (if not easier), enhancing
information discovery for companies.
OVER THE NETWORK
The amount of data that is being generated taxes network capabilities, even with the best broadband
infrastructure. With a T1 (1.544Mbps) Internet connection, it would take approximately 82 days to
upload one terrabyte of data. Even at 10Mbps, it would take almost two full weeks to do so.20
But it isn’t just the size of the data that makes portability a problem. It’s also the rapidly increasing
number of machines that are connecting to the Internet. In August 2010, wireless analyst Chetan
Sharma reported on figures for the U.S. wireless data market, noting that mobile phone subscription
penetration had crossed 95% at the end of the second quarter of 2010. Excluding those aged 5 and
under, this means that the mobile penetration for the U.S. is now past 100%.21 But the increase in new
mobile phone subscriptions is only part of the picture. Outpacing these new human subscriptions for
the same quarter were those of “connected devices.” Even as the U.S. nears full penetration of mobile
devices, an array of other devices and everyday objects are coming online, via sensors, RFID chips —
the “Internet of Things.”
The pressures from more devices coming online are leading governments and organizations to rethink
how Internet bandwidth, wireless spectrum and Internet addresses are allocated and managed.
20 http://aws.amazon.com/importexport/
21 http://chetansharma.com/usmarketupdateq22010.htm
ReadWriteWeb | The Age of Exabytes | 13
17. Use Cases
DISTRIBUTED COMPUTING WITH COUCHDB AT CERN
Scientific research has long had to wrestle with capturing, storing, managing and analyzing massive
amounts of data, but the rise of big data has taxed even the systems designed to study the intricacies
of genomes, weather patterns, outer space, and so on.
One such facility is CERN, the European Organization for Nuclear Research. Situated on the Franco-
Swiss border, CERN is the world’s largest particle physics laboratory and the site of the Large Hadron
Collider, a global scientific project that researches particle collisions using the world’s largest and
most powerful particle accelerator. The LHC produces an enormous amount of data — around 15
petabytes a year. And when the LHC was in its planning stages, CERN’s IT department quickly realized
that that amount of data was more than a data center — and perhaps even the Geneva power grid —
could handle. Instead of one large data warehouse facility, they opted for a grid computing solution,
distributing the collider data to a dozen or so data centers. CERN’s grid consists of 100,000 processors
at 140 scientific institutions in 33 countries.22
One of the LHC experiments is the Compact Muon Solenoid. In order to manage the roughly 10
petabytes of data it collects, CERN announced that it plans to deploy the NoSQL database CouchDB.23
This particular experiment requires a database solution that not only can handle large amounts of data
— often without metadata — but can distribute the data quickly in an environment in which incoming
database connections are frequently impossible. CouchDB is specifically designed for distributed
environments, and one of its key benefits is its replication and syncing features. Furthermore, the
researchers have pointed to the speed with which they can prototype tools using CouchDB.
REAL-TIME RETAIL ANALYTICS
Big data is poised to deliver tremendous insights about consumer’s spending patterns. Retailers have
long tracked when people spend and what they buy. After all, past shopping behavior is the best way
to predict future purchases. But marketing efforts, as the term “mass marketing” implies, have been
imprecise. Now, an incredible amount of information can be gathered about consumers’ shopping
habits: how they browse online, where they shop, when they shop, what brands they buy with what
frequency. And rather than just general demographic information gleaned after-the-fact — knowing,
for example, that a certain coupon worked well with women in their 40s — companies can drill down
into an individual consumer’s profile, and be able to serve them specifically targeted offers in real-time.
For example, as Akamai’s network has grown to encompass more than 450 brands and multi-channel
Internet retailers, it has run into challenges delivering the right ad at the right time to the right
22 http://www.readwriteweb.com/archives/cern_officially_unveils_its_gr.php
23 http://www.readwriteweb.com/enterprise/2010/08/lhc-couchdb.php
ReadWriteWeb | The Age of Exabytes | 15
18. audience. Akamai must deal with up to 75 million daily events, and as its core business value relies on
being able to data-mine that information for advertisers, it needs to be able to analyze data quickly.
With the number of users, profiles, transactions increasing the number of models that must be run for
these records, Akamai found that daily reporting was being delayed by up to 20 hours. Akamai recently
moved its database to Aster Data to take advantage of the company’s nCluster in order to reduce
analytics time.24
MILLIONS OF FARMVILLES MEAN PETABYTES OF DATA DAILY: HOW
ZYNGA HANDLES SOCIAL GAMING BIG DATA
One part of social networking that has seen the meteoric rise has been social gaming. Some 65 million
people play Zynga’s online games every day. According to Zynga CTO Cadir Lee, 10% of the world’s
population has played a Zynga game. That’s millions of Web browsers open to millions farms and
millions of frontiers. They take turns; they tend crops; they send gifts. They buy millions of objects and
upgrades. Zynga says its technology supports 3 billion neighbor connections throughout its games.
And all told, it moves around 1 petabyte of data daily, using a combination of its own data centers and
a hybrid public/private cloud.
It’s a mind-boggling amount of data. And it’s a new kind of data — it’s more than simply transactional
data. And it’s accessed in many ways by many millions of users. This necessitates not simply massive
server resources (the company says it adds as many as 1,000 new servers every week to accommodate
traffic), but has also required the development of a new sort of database management system.
Zynga has been a major contributor to the open source Membase project, taking some of the concepts
of Memcached — low cost, high performance, schema-less caching — in order to develop a database
that works with similar speed, flexibility and simplicity.
Zynga needs to be able to serve up all this data not only to its millions of users. It also has to be able to
undertake analytics on the gameplay in order to, for example, design engaging and viral games and to
ascertain the points at which players are willing to purchase virtual goods.
24 http://www.asterdata.com/customers/customer-casestudies.php
16 | ReadWriteWeb | The Age of Exabytes
19. THE BIG DATA MARKETPLACE
The amount of data being produced — by science, governments and social networks — has given rise
to a number of companies that are specifically geared towards the storage, sale, and analysis of data.
For example, Infochimps, a startup based out of Austin, Texas, describes itself as a marketplace for data:
“A site to find, sell, or share any dataset in the world.” Infochimps makes a variety of datasets available,
including massive data scraped from Twitter. (A recent scrape contains data about 35 million users, 500
million tweets, and 1 billion relationships between users). Some of the datasets are available for free,
and some for a price. Infochimps also makes some of the data available via an API, in lieu of sending
an entire dataset.25 Factual is another startup that is offering access to massive datasets, in this case
geolocation data, alongside an API and other tools for building geolocation applications.26
BIGGER DATA AND A BETTER RESPONSE: EARTHQUAKE DETECTION &
CRISIS RESPONSE
Although big data is often touted for its scientific and commercial implications, it has also becoming
an important tool for humanitarian purposes, as responses to recent natural disasters have
demonstrated. Open data advocates and developers have formed groups like CrisisCommons and
projects like OpenStreetMaps in order to build tools to help the public good. The World Bank, for
example, has made a substantial amount of its data open, and has encouraged people to build tools to
help understand the information to be able to better respond to natural disasters and other crises.
25 http://www.infochimps.com
26 http://www.factual.com
ReadWriteWeb | The Age of Exabytes | 17
21. Conclusion
We marvel at the fact that today our smartphones have far
more RAM than our first personal computers did. But with
these phones, PCs, and with other connected devices, we
are generating almost unfathomable amounts of data, and
generating a demand, in turn, for ever more storage. The
average person is uploading over 15 times more data to the
Internet today than they did just three years ago.27 And the
information uploaded by humans is dwarfed by the Internet of
Things, the networking of everyday objects.
The explosion in data is creating challenges and prompting innovation in computer storage and
processing, in terms of software, hardware and data center architecture. The desire to be able to glean
insights from all this data is also set to be a boon for analysts and statisticians. And it’s creating many
opportunities for new companies who can deliver technology products and services to help solve
some of the challenges associated with big data.
And there are plenty of challenges. Moore’s Law has so far proven accurate — processing power has
increased and costs of manufacturing computer chips have gone down. But the cost of powering the
machines has soared. And when you are handling data on an exabyte scale, the energy costs to power
and cool machines — particularly those in the massive data centers — are substantial.
In addition to facing problems with power consumption, the amount of data being generated also
taxes network infrastructure. As the Internet struggles to maintain speeds and bandwidth, broadband
and wireless continue their penetration into new areas.
We have only begun to develop the tools to manage and analyze all this data. As the majority of this
data is unstructured, it has often remained beyond the scope of analysis. As the data is classified,
questions of interoperability are raised — how can we structure and classify this information so it is
usable within companies and across industries?
27 http://www.readwriteweb.com/archives/the_coming_data_explosion.php
ReadWriteWeb | The Age of Exabytes | 19
22. But some people are cautious about the race to create and network all this data — to make this data
available and useful — particularly when it comes to personal information. How will organizations
ensure that data is kept private and secure? What sorts of controls will people have over the data they
create, over the data their personal objects create?
As we continue generating almost inconceivable amounts of information, it is clear that the data
explosion will bring about challenges for businesses and for IT departments. Big data will be a
problem that all organizations will need to address, whether “big” is on the scale of terabytes or
exabytes of data. As companies increasingly look for solutions to their big data problems, this will in
turn create opportunities for others to develop technologies and practices to best store, manage and
analyze big data.
20 | ReadWriteWeb | The Age of Exabytes
23.
24. HP FlexFabric
Virtualize network connections and capacity—From the edge to the core
An HP Converged Infrastructure innovation primer
25. Table of contents
Data center networking dynamics ........................... 3
Introducing HP FlexFabric ...................................... 3
HP FlexFabric benefits ......................................... 4
The key attributes of HP FlexFabric.......................... 5
The FlexFabric evolution path ................................. 6
Deliver “networking as a service” to the Converged
Infrastructure ...................................................... 6
26. Data center networking Network teams are faced with a race to build out
data center network capacity and to effectively
dynamics provision connectivity at an increasing speed.
To keep pace, IT organizations need a network
The fundamental nature of data center computing
architecture that is more coherent, flexible, and
is rapidly changing. The traditional model of
agile. But they don’t want to give up the stability,
separately provisioned and maintained server,
high availability, and security offered by the proven
storage, and network resources are constraining
compute and storage networks currently installed in
data center agility and pushing budget envelopes
their data centers.
to the limit. IT organizations recognize that these
static pools of isolated resources are being HP is creating a new balance by combining some of
underutilized—a problem that can be exacerbated the best, new, standards-based technologies with a
when dedicated infrastructure or computer streamlined, modular architecture that fully optimizes
systems are used to support different classes of virtualized resources, while meeting business
data center workloads. One response has been requirements for low total cost of ownership,
for IT organizations to adopt virtualization and faster time-to-service, and critical requirements for
blade technologies, which enable a more flexible reliability, IT governance, and compliance.
and highly utilized infrastructure. These new,
more scalable technologies can be dynamically Introducing HP FlexFabric
provisioned to meet continuously evolving business
HP FlexFabric is the next-generation, highly scalable
requirements. At the same time, these technologies
data center fabric architecture of an HP Converged
apply new pressures to the multiple networks in
Infrastructure. With FlexFabric, you can provision
the data center, further worsening spend issues.
your network resources efficiently and securely to
And it increases the burden on the IT teams that
accelerate deployment of virtualized workloads.
support them:
With highly-scalable platforms and advanced
• A proliferation of virtual machines is driving much networking and management technologies,
more frequent changes to network configurations. FlexFabric network designs are simpler, flatter, and
• Data center network processes must be easier to manage and grow over time. This open
coordinated through multiple IT teams and are too architecture uses industry standards to simplify
time-consuming. server and storage network connections while
• Increases in server utilization require more network providing seamless interoperability with existing
bandwidth per server. core data center networks. FlexFabric combines
intelligence at the server edge with a focus on
• Traditional hierarchical network designs cannot
centrally-managed connection policy management to
scale nor provide the performance, low latency,
enable virtualization-aware networking and security,
availability, and quality of service demanded by
predictable performance, and rapid, business-driven
a virtualized data center.
provisioning of data center resources.
• Blade technology is further escalating the number
of connections to be managed and increasing
bandwidth density.
3
27. HP FlexFabric overview
HP FlexFabric brings together a highly-scalable, high performance, secure network infrastructure with comprehensive management and
policy-driven connectivity provisioning integrated into a data center converged infrastructure
Converged Infrastructure/Matrix
VM Edge Access Operating Environment
Flexible virtual I/O, hypervisor Data center management and orchestration
agnostic, emerging VEPA
standard support
Highly-available data
Intelligent Server Access center Backbone
Flexible form factors, pragmatic Carrier-class routing and Integrate management
storage-server I/O consolidation, wide-area connectivity and administration with
future-proofed for convergence, converged infrastructure
optimized for data center workload
mobility and utilization
Servers Backbone
Storage
“FC-SAN”
Interconnect
Server Edge
FlexFabric Management
Multi-site, multi-vendor network resource
FlexFabric Security management and “Days to minutes”
rapid, dynamic, policy-driven resource
High performance Layer provisioning, data center integration
2/Layer 3 Interconnect
Virtualization-integrated Security Predictable, high-performance,
High capacity, high performance, high-bandwidth, existing Layer 3
highly-available threat management core-compatible, designed to fully
exploit workload virtualization
FlexFabric can enable your IT organization to • Modular, scalable, industry standards-based
build a wire-once data center that responds to platforms and multi-site, multi-vendor management
application and workload mobility, and provides tools to connect and manage thousands of server
resource elasticity. You can move your network and storage devices using industry-standard
connections with your workloads as you migrate building blocks
them across or between data centers. Also, the • Investment protection for existing Layer 3 core
fabric can stretch and reclaim pools of resources to systems with seamless compatibility and support
meet rapidly changing needs. High-performance for open standards
threat management tools unify physical and virtual
• Flexibility to manage and administer server,
security into a common, extensible framework.
storage, and network resources in any
Dynamic provisioning capabilities fully exploit
organizational model—from completely separate
virtualized connections to achieve new levels of
to fully integrated—while consistently enforcing
data center efficiency and accelerate time-to-service.
governance, security and SLA policies
The FlexFabric management and provisioning tools
help align the fabric with governance policies and • Removal of costly and time-consuming change
service-level agreements (SLAs), while reducing the management processes, while reducing the
cost of operations. number of error-prone or conflicting
configuration steps
HP FlexFabric benefits • Support for a wide range of data center
deployment models
• Improved business agility, faster time-to-service
and higher resource utilization by dynamically FlexFabric delivers true “networking-as-a-service”
and securely scaling capacity and provisioning to the various consumers of connectivity within
connections to meet virtualized application the data center and accelerate deployment of
demands “on the fly” applications and services. It provides a unified
• Breakthrough cost reductions by converging connectivity infrastructure—across servers, storage,
and consolidating server, storage, and network and networking—that dynamically adapts to the
connectivity onto a common fabric with a flatter demands of the heavily virtualized and more flexible
topology and fewer switches data center architectures of tomorrow, while meeting
increasing pressures for price/performance and
• Predictable performance and low latency to
time-to-service.
support some of the most demanding
application workloads
4
28. The key attributes of within the server edge and advanced multi-switch
virtualization and management in the interconnect.
HP FlexFabric Multiple server edge and interconnect switches
can be virtualized and managed as single logical
By radically simplifying and flattening network
devices with improved utilization, high availability,
designs and using emerging data center networking
scalability, and flexibility to handle virtualized
standards, HP FlexFabric creates a more robust,
workloads with very high throughput. Capacity can
flexible, and efficient data center network
be dynamically scaled or divided.
infrastructure. Rather than relying on a traditional
hierarchical networking architecture, FlexFabric FlexFabric networks are designed to meet the
offers a flatter data center topology with edge security, resiliency, and reliability requirements
intelligence, designed to complement the intelligent expected in today’s data center.
virtualized network interfaces offered by the latest
Open and standards-based for investment
HP data center servers and storage systems. This flat
protection
fabric interconnect is more fungible and provides
FlexFabric is designed to interoperate with existing
superior network performance and quality of service.
third-party Layer 3 core switches to protect existing
To manage the FlexFabric network, you can design investments and enable smooth network migration.
and centrally manage fully-virtualized network This standards-based approach removes the
connections and resources that allow for dynamic risk of vendor lock-in and lets your organization
provisioning from the edge to the core and support incrementally deploy a FlexFabric network without
for application mobility, enabling connections to disruptive forklift upgrades. You can mix and match
move with workloads as they migrate across the existing operational processes with new approaches
fabric. This allows resources to be created, moved, using industry-leading HP products to coordinate IT
and scaled from centralized connection pools “on teams. Finally, this approach helps your organization
the fly,” putting to work an integrated resource and manage the high purchase, support, and operations
provisioning management toolset. costs associated with proprietary environments.
To secure the FlexFabric network, Pragmatic deployment of new technologies
a virtualization-integrated security framework HP FlexFabric utilizes the latest emerging industry
provides business continuity with unified, high standards, including higher speed Ethernet links,
performance physical/virtual server network security Virtual Ethernet Port Aggregation (VEPA), Fibre
architecture. This framework enables seamless threat Channel over Ethernet (FCoE), and Converged
management and leverages a global threat Enhanced Ethernet (CEE). The CEE standard enables
intelligence network to block bad traffic in virtual Ethernet to deliver a “lossless” transport technology
and physical environments. with congestion management and flow control
features needed in storage environments. Leveraging
FlexFabric is designed to support a much wider
FCoE today, FlexFabric server edge platforms allow
set of data center architectures, workloads, and
for sensible storage-server I/O consolidation with
requirements than is otherwise possible with
assured compatibility with existing Fibre Channel
traditional data center networking approaches.
Storage Area Networks (FC-SANs). This allows users
It supports specialized back office, cloud, web,
to reduce cost and complexity without jeopardizing
or high-performance computing models. Instead
business continuity. HP is championing many of these
of locking organizations into a proprietary
and other emerging standards in the IEEE
end-to-end solution, FlexFabric gives them the
and other organizations, to give users a data
flexibility to incrementally deploy a heterogeneous
center fabric that protects their technology
data center network that meets their workload needs
investments instead of proprietary approaches that
and protects existing investments.
can cause organizational disruption and wholesale
Predictable performance supports diverse equipment replacement.
workloads
Data center-integrated management and
A highly scalable, flat network domain enables
provisioning for business agility
HP FlexFabric to deliver flexible provisioning,
With management and provisioning integrated
ultra-low latency, high performance, and fast
down to the component level—including networking
workload mobility. The architecture provides
and virtual I/O—HP is revolutionizing data center
breakthrough cost structures by removing
provisioning and operation. Comprehensive
networking layers and complexity, and applying
network resource management tools allow users to
new technologies including higher speed Ethernet
administer networks across multiple sites and against
links, active load balancing, and link aggregation
a combination of HP and multi-vendor platforms
5
29. from a single pane of glass. Integrated FlexFabric business continuity at the top of the list of
provisioning capabilities reduce time to service principles guiding our vision for a Converged
and the chance of costly errors while accelerating Infrastructure network.
IT alignment with business demands and goals.
Today—A network foundation for
FlexFabric enables administrators to centrally
FlexFabric agility
define connection and network policies that can be
First introduced in 2006, Virtual Connect technology
dynamically matched to workloads and provisioned
is a key enabler of an integrated, data
“on the fly” from pools of available resources. The
center-aligned network, and delivers against
FlexFabric model allows a “design once, replicate
foundation HP FlexFabric principles by providing
many” approach to provisioning that is optimized for
some of the simplest, most flexible ways in the world
workload mobility, streamlines network provisioning,
to provide high-performance, secure server
and reduces the number of error-prone or possibly
connectivity. With reduced complexity, improved
conflicting configuration steps that make change
agility, and reduced cost, Virtual Connect radically
management time-consuming and costly.
simplifies network infrastructure and provisioning
FlexFabric removes a major barrier to automation without disrupting “upstream” network operations.
and orchestration—the “all-or-nothing” proposition
HP Virtual Connect virtualizes server edge I/O,
organizations face with other data center
enabling server administrators to provision Local
management frameworks. Designed to support
Area Network (LAN) and Storage Area Network
a wide range of IT organizational models,
(SAN) resources in advance, and then enable
FlexFabric offers interfaces designed specifically
them when needed. Virtual Connect enables
for each operator type found in IT teams. Network
server administrators to move workloads and
administrators can provision resources in advance
virtual machines, or add, move, or replace servers
and make them available to server and storage teams
transparently to LANs and SANs in minutes without
to utilize instantly when needed, saving time and
having to engage LAN and SAN administrators.
speeding service.
Attacking head-on the expensive proliferation
FlexFabric management integrates seamlessly across
of Ethernet connections caused by increased
the entire spectrum of HP data center management
network capacity requirements for virtual machines,
systems to streamline the activities of your data
HP Virtual Connect FlexFabric modules and adaptors
center IT teams without requiring extensive overhauls
can reduce sprawl at the edge by 95%. Virtual
of organizational structure and processes. This
Connect FlexFabric modules provides up to four
powerful system can automate and coordinate
physical connections for each network port, with
network services with application deployment, and
the unique ability to fine-tune bandwidth to adapt to
free up data center administrators from repetitive
virtual server workload demands on the fly.
operational activities that drain IT budgets.
The system administrator can now define the
FlexFabric provides open interfaces for third-party hardware personalities of these connections as
functionality that integrates application delivery and FlexNICs to support only Ethernet traffic or as
virtualization engines. Finally, FlexFabric management FlexHBAs that combine Ethernet and Fibre Channel
is fully integrated with industry-leading IT orchestration or iSCSI protocol support. Each connection has
and management systems from HP, giving your IT staff 100 percent hardware-level performance and
unprecedented control that spans networks, servers, provides the I/O connections needed to take full
applications, and even physical plant attributes. advantage of multi-core processors and to support
more virtual machines per physical server. Each
The FlexFabric evolution path server can support many more connections—
up to 40—with less investment in expensive network
Deliver “networking as a service” to equipment on the server, in the enclosure and in the
the Converged Infrastructure corporate network.
FlexFabric is more than just an aspirational model The bandwidth of each connection can be
of the ideal data center network. Users can deploy fine-tuned and adapted with 100 Mb increments
networks today that deliver on the FlexFabric value up to 10 Gb as workload demands change. The
proposition—aggressively or incrementally—in server comes with 10 Gb capability built into it,
keeping with overall technology and business ready for today’s investments in 10 Gb networks and
objectives. This evolutionary and flexible approach converged fabric technologies like Fibre Channel
to data center deployment across the infrastructure over Ethernet. Virtual Connect FlexFabric modules
puts real user needs, investment protection, and allow users to take advantage of edge convergence
by providing Fibre Channel over Ethernet (FCoE)
6
30. downlinks to the blades while maintaining standard HP provides powerful tools for managing
and proven Ethernet LAN, Fibre Channel SAN, and large-scale FlexFabric networks both in advanced
iSCSI external connections with their associated Virtual Connect-based and traditional network
IT practices. This allows system administrators to server edge deployments. With HP Virtual Connect
simplify enclosure infrastructure and lower costs Enterprise Manager, users can manage the setup
by combining Ethernet, Fibre Channel, and iSCSI and migration of up to 16,000 Virtual
protocols over one wire and managing them from connect-based servers from a single pane of glass.
a single management application and interface. As the foundation for comprehensive network
For any virtual server environment, Virtual Connect resource management across the entire enterprise
FlexFabric modules and adapters are simply some network, Intelligent Management Center (IMC)
of the most affordable, flexible, and power-efficient lets users manage an entire multi-site, multi-vendor
solutions available from any blade portfolio. network, edge to core, from a single
management console.
For organizations preferring a traditional server
edge implementation, network management and Securing the FlexFabric is a set of tools that
design methodology, HP offers scalable blade-based brings threat management for both virtual and
switching. For users looking to achieve high levels physical networking together into a single,
of server connectivity consolidation and top-of-rack enterprise-class architecture. The HP TippingPoint
switch platforms that deliver high performance, Secure Virtualization Framework lets users leverage
advanced multi-switch virtualization, and flexible highly scalable appliance-based Intrusion Prevention
connectivity, options like FCoE that provide Systems (IPS) to comprehensively secure VM-to-VM
cost-effective storage-server I/O consolidation and as well as inter-server and inter-network traffic from
1 Gb to 10 Gb migration are available. With the a common IPS infrastructure. Combined with a wide
6120 series of blade switches or the A5820 series range of security subscription services that leverage
of fixed and semi-modular top-of-rack switches, a global threat intelligence network to block bad
users have multiple ways to incrementally deploy traffic in virtual and physical environments, users
a FlexFabric server edge that are in keeping with can provide continuity as they scale out server
traditional network designs. virtualization deployments.
Complementing the FlexFabric Server Edge offering, Tomorrow—A new model for deploying
HP offers a complete portfolio of enterprise-class networking as a service
interconnect and backbone platforms that deliver With a vision toward provisioning of network
aggregation, core switching, and enterprise connectivity and resources completely synchronized
routing functionality. These platforms are built in an end-to-end data center orchestration layer,
on cutting-edge technology and provide HP has developed the Data Center Connection
industry-leading performance, lower power Manager (DCM) appliance as a proof-point for
consumption, and lower TCO with a unified switch how networking can be enabled to accelerate
operating system that let users built simpler, flatter deployment of virtualized server workloads.
networks with comprehensive management.
HP Data Center Connection Manager begins to
Complete feature functionality and mission-critical
implement the HP FlexFabric dynamic provisioning
high availability means that users can deploy a wide
vision. DCM allows network architects to
variety of designs to accommodate existing Layer
preconfigure server connection policies that are
3 core investments or to radically simplify the network
enforced at the network edge through common
in collapsed aggregation/core designs. Advanced
RADIUS and DHCP standards. Virtual and physical
multi-switch virtualization technologies allow users
server interfaces are individually associated or
to build cost-effective, large layer 2 aggregation
subscribed to connection profiles from a pool of
layers ideally suited for large-scale virtualization
resources by the server administrator at build time,
installations. With a continued commitment to open
allowing rapid, secure provisioning and workload
standards-based interoperability, users can easily
mobility without the repetitive manual tasks and
integrate, proven third-party data center applications
turnaround time associated with provisioning today.
and technologies, and avoid vendor lock-in. These
These policies can drive events directly to the HP
data center networking products include the
BSA Network Automation software product suite,
A-series of switches and routers, such as A6600/
enabling deep levels of dynamic automation to
A8800 enterprise routers and the industry’s highest
provision firewalls or application delivery controllers
performance A12500 series switches.
in response to server provisioning, de-provisioning
or configuration changes. These capabilities
give network administrators the power to deploy,
manage, and evolve server connectivity flexibly,
quickly, and in line with business policy
and demands.
7
32. If you liked this report, check out our other reports:
Guide to Online Community Management
The
ReadWriteWeb Our first premium report for businesses comes in two parts:
Guide to Online Community
Management
a 75 page collection of case studies, advice and discussion concerning
Edited By Marshall Kirkpatrick
May 2009 the most important issues in online community; and a companion online
aggregator that delivers the most-discussed articles each day written by
experts on community management from around the Web.
http://www.readwriteweb.com/reports
ReadWriteWeb Premium Guide to Online Community Management page 1
The Real-Time Web
Real-time Web technologies and applications have the potential to change
and its Future
everything—at a real-time pace. If you are a CTO, work in development,
marketing or you are planning your next website or mobile application
upgrade, you need to know about the real-time Web.
Edited by Marshall Kirkpatrick
http://www.readwriteweb.com/reports
Augmented Reality for Marketers and Developers:
Analysis of the Leaders, the Challenges and the Future
Augmented Reality for
Marketers and Developers: AR offers a new paradigm for high impact, high value customer
Analysis of the Leaders, the
Challenges and the Future
experience. Decrease your AR development time to market by learning
from the first wave of early adopters to this new technology. In this
Written by Chris Cameron
ReadWriteWeb Premium Report we profile successful companies and their
campaigns as well as development lessons learned.
http://www.readwriteweb.com/reports