SlideShare uma empresa Scribd logo
1 de 98
Baixar para ler offline
What is the future of the RDBMS in the Enterprise?
School of Computer Science and Statistics
TRINITY COLLEGE
What is the future of the RDBMS in the Enterprise?
Stuart Clancy
Edward Fitzpatrick
Degree Year
BSc (Hons) Information Systems
11th April 2011
A Dissertation submitted to the University of Dublin in partial fulfilment of the
requirements for the degree of BSc (Hons) Information Systems
Date of Submission: 11th April 2011
What is the future of the RDBMS in the Enterprise?
- III -
Declaration
We declare that the work described in this dissertation is, except where otherwise stated,
entirely our own work, and has not been submitted as an exercise for a degree at this or any
other university.
Signed:___________________
Stuart Clancy
Date of Submission:
Signed:___________________
Edward Fitzpatrick
Date of Submission:
What is the future of the RDBMS in the Enterprise?
- IV -
Permission to lend and/or copy
We agree that the School of Computer Science and Statistics, Trinity College may lend or
copy this dissertation upon request.
Signed:___________________
Stuart Clancy
Date of Submission:
Signed:___________________
Edward Fitzpatrick
Date of Submission:
What is the future of the RDBMS in the Enterprise?
- V -
Acknowledgements
We would like to acknowledge and thank Ronan Donagher, our project supervisor and Diana
Wilson, the acting course director for their support, guidance and understanding throughout
our research project.
We would also like to acknowledge the unfailing support of our families, who have
encouraged us throughout the years of our study; our employers and work colleagues, who
have been patient and flexible with working arrangements in order to allow us to complete
our studies; and close friends who on occasion are called upon to provide a welcome
distraction and perspective.
Signed:___________________
Stuart Clancy
11th
April 2011
Signed:___________________
Edward Fitzpatrick
11th
April 2011
What is the future of the RDBMS in the Enterprise?
- VI -
Abstract
Managing data and information has been feature of human activity since the first
acknowledged symbols were etched onto stones by Neolithic humans. Since the emergence
of the Internet data as an available resource to man and machine has been growing rapidly.
This dissertation looks at what this means for the traditional relational database management
system (RDBMS). It asks if there is a future for the RDBMS in enterprise information system
architecture. It also examines the early developmental years of RDBMS in order to gain an
insight as why it has enjoyed relative longevity within a rapidly changing technology
environment. New types of database and data management systems are discussed such as
NoSQL and other open source non-relational DBMS such as Hadoop and Cassandra. The
data volume and data type problem is absorbed into various sections under the umbrella term
‘Big Data’. Utility companies and social networking sites are two sectors where the
management of large data volumes is a growing concern are examined in the two case
studies. A separate chapter on the research methodology chosen by us is included. It provides
the necessary balance between subject matter and method as set out in the initial
requirements.
Keywords:
Relational Theory, DBMS, RDBMS History, NoSQL, Hadoop, Cassandra, Database Market,
Big Data, Research Methodology.
What is the future of the RDBMS in the Enterprise?
- VII -
Table of Contents
Abstract....................................................................................................................................VI
List of Figures...........................................................................................................................X
List of Tables.............................................................................................................................X
List of Abbreviations...............................................................................................................XI
Chapter One - Introduction.................................................................................................... 1
1.1 The Research Question ........................................................................................... 1
1.2 Document Roadmap ................................................................................................ 2
Chapter Two - Literature review, findings and analysis ......................................................... 4
2.1 Introduction............................................................................................................. 4
2.2 RDBMS................................................................................................................... 4
2.2.1 History of the RDBMS ....................................................................................... 10
2.2.2 Main Features of ‘true’ RDBMS......................................................................... 13
2.2.3 IBM, Ellison and the University of California, Berkley....................................... 15
2.3 New Databases ...................................................................................................... 19
2.3.1 Features of NoSQL Databases ............................................................................ 20
2.3.2 Hadoop............................................................................................................... 23
2.3.2.1 Components of Hadoop ................................................................................... 24
2.3.3 Cassandra ........................................................................................................... 25
2.4 The market for RDBMS’ and Non-Relational DBMS’........................................... 27
2.4.1 Introduction........................................................................................................ 27
2.4.2. RDBMS Market................................................................................................. 27
2.4.2.1 Vendor Offerings............................................................................................. 28
2.4.4 Open Source Databases....................................................................................... 32
What is the future of the RDBMS in the Enterprise?
- VIII -
2.4.4.1 Non-RDBMS Market....................................................................................... 32
2.5 Case Studies .......................................................................................................... 36
2.5.1 Case Study 1- Utility Companies and the Data Management challenge ............... 36
2.5.1.1 Introduction ..................................................................................................... 36
2.5.1.2 Utilities............................................................................................................ 36
2.5.1.3 Smart Grid - The ESB case .............................................................................. 39
2.5.1.4 The Data Volume Problem............................................................................... 41
2.5.1.5 How one utility company is meeting the data volume challenge....................... 44
2.5.1.6 What is the ESB doing? ................................................................................... 45
2.5.1.7 Conclusion....................................................................................................... 46
2.5.2 Case Study 2 - Social Networks – The migration to Non-SQL database models.. 47
2.5.2.1 Facebook Messages ......................................................................................... 48
2.5.2.2 Twitter - The use of NoSQL databases at Twitter............................................. 49
Chapter Three - Research Methodology .............................................................................. 52
3.1 Introduction........................................................................................................... 52
3.2 The strategy adopted for researching the question.................................................. 53
3.3 A Theoretical Framework...................................................................................... 55
3.4 Research Design.................................................................................................... 57
3.5 Methodology - A Qualitative Approach................................................................. 58
3.6 Methods................................................................................................................. 58
3.6.1 Method - Analytic Induction............................................................................... 59
3.6.2 Method - Content Analysis ................................................................................. 59
3.6.3 Method - Historical Research.............................................................................. 59
3.6.4 Method - Case Study........................................................................................... 60
3.6.5 Method - Grounded Theory ................................................................................ 60
3.7 Ethics Approval.................................................................................................... 61
3.8 Audience .............................................................................................................. 61
What is the future of the RDBMS in the Enterprise?
- IX -
3.9 Significance of research......................................................................................... 61
3.10 Limitations of the research methodology ............................................................. 62
3.11 Conclusion....................................................................................................... 62
Chapter Four - Conclusions, Limitations of Research and Future Work............................... 63
4.1 Introduction........................................................................................................... 63
4.2 Conclusions........................................................................................................... 64
4.2.1 RDBMS.............................................................................................................. 64
4.2.2 New DB’s........................................................................................................... 64
4.2.3 Market................................................................................................................ 65
4.2.4.1 Case Study 1 - Utility Companies .................................................................... 66
4.2.4.2 Case study 2 - Social Networks........................................................................ 66
4.3 Future Research..................................................................................................... 67
4.3.1 NoSQL ............................................................................................................... 67
4.3.2 Case Studies ....................................................................................................... 68
4.3.3 Business Intelligence .......................................................................................... 68
4.3.4 Research Methodology ....................................................................................... 68
4.4 Limitations of the Research ................................................................................... 69
4.5 Final thoughts........................................................................................................ 70
REFERENCES............................................................................................................ 71
APPENDIX 1.............................................................................................................. 85
What is the future of the RDBMS in the Enterprise?
- X -
List of Figures
Figure 2.1 - A simplified DBMS .......................................................................................... 9
Figure 2.2 – Overview of a generic Smart Grid ................................................................... 40
Figure 2.3 - ESB proposed implementation of Advanced Metering ..................................... 41
Figure 2.4 – Smart Meters transaction rate…………………………………………………..42
Figure 2.5 – Smart Meters data size………………………………………………………….42
Figure 2.6 - Sources of Smart Grid data with time dependencies……………………………43
List of Tables
Table 2.1 - Impact of unstructured data on productivity............................................................8
Table 2.2 – Example of redundant rows in a database.............................................................14
Table 3.1 - Key concepts in Qualitative and Quantitative research methodologies.................54
Table A.1 - Edgar Codd’s original relational model terms…………………………………..85
What is the future of the RDBMS in the Enterprise?
- XI -
List of Abbreviations
ACID – Atomicity, Consistency, Isolation and Durability.
ACM – Association of Computing Machinery.
BA - Business Analytics.
BASE - Basically Available, Soft state, Eventual consistency.
BI - Business Intelligence.
BSD - Berkeley Software Distribution.
CA - Computer Associates.
CAP - Consistency, Availability and Partition tolerance.
CIS – Customer Information System.
CODASYL – Conference on Data Systems Language.
CRM - Customer Relationship Management.
DBMS – Database Management Sys
DMS – Distribution Management System.
DW- Data Warehousing.
ERM - Enterprise Relationship Management.
GB - Gigabyte
GBT - Google Big Table.
GFS - Google File System.
GIS - Geographical Information System.
HA - High Availability.
HDFS - Hadoop Distributed File System.
IA – IBM’s Information Architecture.
What is the future of the RDBMS in the Enterprise?
- XII -
IBM - International Business Machines.
ISV - Independent Software Vendor.
IT – Information Technology.
KB - Kilobyte
MB - Megabyte
MDM - Meter Data Management.
MPL - Mozilla Public Licence.
MR - MapReduce.
NoSQL – ‘No’ SQL or more often ‘Not Only’ SQL.
OEM - Original Equipment Manufacturer.
OLAP - Online Application Processing.
OLTP - Online Transaction Processing.
OMS - Outage Management System.
OS - Operating System.
OSI - Open Source Initiative.
PB - Petabyte
PDC - Phasor Data Concentrators
PLM - Product Life-cycle Management.
RDBMS - Relational Database Management System.
SCADA - Supervisory Control and Data Acquisition.
SOA - Service Oriented Architecture.
SQL - Structured Query Language.
TB - Terabyte
Page 1
Chapter One - Introduction
Humans have being storing information outside of the brain probably before the first
consistent markings on a bone were found in Bulgaria dating from more than a million years
ago. Certainly so since the later Neolithic clay calculi bearing symbols representing
quantities, the cave paintings at Lascaux over 17,000 ago; through to the invention of the
moveable type printing press and eventually to the first computers. Since the emergence of
the information age of the last fifty years or so the amount of data transferred and stored in
computers has grown rapidly. Research from the International Data Corp (IDC) in 2008 puts
that growth at 60% per annum (The Economist, 2010).
An added complexity is that executive strategies now have business intelligence for
competitive edge as a key goal. Data management systems that for many years have been the
old reliable work horse toiling away in the back end somewhere are once again playing a key
role in driving business growth. The question is, are they still capable of carrying out this new
and challenging task? This dissertation asks that question and more specifically what is the
future for the Relational Database Management System (RDBMS) in the Enterprise?
The data volume problem now has a name ‘Big Data’. Its nascence coincides with the growth
of the Internet. Alternative solutions to traditional RDBMS to deal with ‘Big Data’ soon
followed. Much of these solutions are either based on multi-parallel processing (MPP a.k.a
distributed computing) or flipping the row store of RDBMS into column store systems. More
recently MPP solutions are being positioned not as alternatives but complements to RDBMS
(Stonebraker et al., 2010). Add to this mix a dynamic data management market where
vendors are acquiring new technology, merging with each other, adopting open source and
creating hybrid stacks in an effort to gain advantage in a market deemed to grow to $32
billion by 2013 (Yuhanna, 2009).
1.1 The Research Question
Time was taken to carefully frame our research question so as to provide a clear path of
exploration on the subject. The subject could have been framed as a predicated hypothesis
such as: “The future for RDBMS in the Enterprise is looking bright” or a contrary statement
“The end is nigh for RDBMS”. We chose to frame our research as an open ended question to
What is the future of the RDBMS in the Enterprise?
Page 2
allow for a broad exploration of the subject with no preconception of the outcome. The
broadness of scope however is necessarily tempered by restricting our research to those
organisations defined as enterprises. There is difficulty here as there is no overarching
definition for an enterprise organisation. However, it is necessary to provide some clear
defined boundaries around the term. For this dissertation an enterprise is defined not by size
or function alone.
Enterprises for us are organisations where the scale of control is large. They include
companies with a large amount of customers and employees, as well as companies that
control a large infrastructure or several functional units. Enterprises have one top-level
strategy to which all other functional units are aligned. The last point is an important
characteristic of an enterprise for our dissertation as it applies to decision making for
acquiring information management systems.
The presence of the word 'future' is central to locating the research in an exploratory and
intuitive research domain. It prompts looking into the past in an attempt to explain the present
and predict the future. It forces an open mind and questioning approach. It enables the
creation of new ideas which are either taken on or set aside for another time. The chapters
and sections are set out below in an attempt to follow this map in the view that the journey is
the objective rather than the destination.
1.2 Document Roadmap
In writing this dissertation a balance was sought between addressing the issues raised by the
initial question and the research methodology chosen. The bulk of this dissertation therefore
centres on those two areas. In this chapter we introduce the concept of our research and why
we feel it is interesting to us. The research question is explained and the objective is put in
context. Chapter two contains the literature review. The chapter begins with an outline of
RDBMS, its features and history of development. Particular attention is given to the role of
IBM in the development of RDBMS. The chapter moves on to discuss new databases and
data management systems. A section on the DBMS market follows and presents an overview
of the current vendor offerings. The market section does not attempt a comparison of
available systems as this work was carried out in greater detail by others more expert than us.
Throughout the dissertation we refer the reader to such work where it is not feasible for us to
reproduce it.
What is the future of the RDBMS in the Enterprise?
Page 3
Two case studies are included for the benefit of putting the research question in a practical
context. The two areas chosen involve contrasting enterprises. On one hand there is the
relatively long established utilities sector and on the other the new phenomenon of social
networking and its associated companies. Even though they operate in widely different
markets generating different types of data, they both share similar problems when it comes to
managing large amounts of data. Likewise, both are trying to get to grips with extracting
value out of data for competitive edge.
Chapter three discuss the research methodology chosen by us. It deserves a chapter to itself in
view of the objective of this dissertation. The chapter begins with an introduction on research
theory. It then moves to a discussion on our research strategy. A research framework is
introduced as a model of our strategy. The different methodologies available are outlined and
our chosen option is explained. Next, a group of related research methods are outlined and
the reason for their selection is stated. Short sections on ethics approval, audience and the
significance of the research follows before a final section on the limitations of our chosen
research methodology closes the chapter.
The final chapter attempts to pull together the conclusions and findings from the all the
previous sections. Relevant research threads and ideas not covered in sufficient detail in the
dissertation are mentioned. The last sections present a summary of the limitations of the
overall research and our final concluding thoughts.
What is the future of the RDBMS in the Enterprise?
Page 4
Chapter Two - Literature review, findings and analysis
2.1 Introduction
In this section the focus is on RDBMS. The intention is to provide an overview of its defining
features. It is not an in-depth technical analysis of RDBMS and we would refer the reader to
better papers on the subject such as those published in the Communications of the
Association of Computing Machinery (ACM) of which we refer to several times. It also sets
out the background to the development of RDBMS. Within that context an interesting
discovery is made with respect to IBM’s initial role in the development of database
management systems. For the purpose of exploring the question on the future of RDBMS
some associated concepts are discussed such as data types, ‘true’ RDBMS, and whether or
not the past can teach us something about the future.
2.2 RDBMS
Databases
It is unfortunate that in realm of Information Technology (IT) acronyms are not always self-
explanatory. Many such acronyms don’t travel outside of their specific domain very well.
Take for example DQDB or Distributed Queue Dual Bus; outside of the world of high speed
networks this may seem to be a very efficient urban transport vehicle. Luckily the term
RDBMS contains within itself the individual components which define it: a system (S)
composed of a database (DB) where information is stored by creating relationships (R)
between data elements and which can be managed (M) by users. It is helpful at this point to
explain the hierarchy, at least, of each of these components.
Throughout this dissertation data (and datum-singular) and information are taken to be a
classifications of entities stored in a system. Data being lowest in the sense of the taxonomy
data – information – knowledge - wisdom (sometimes called understanding) but not lower in
What is the future of the RDBMS in the Enterprise?
Page 5
real value; a single digit integer may be enough data to invoke the required wisdom to make
an important decision. For the purpose of simplicity, data here means a binary entry (such as
yes or no, 1 or 0), or a nominal entry (such as dog, 470, Smith, XRA9000 etc.). An analogy
from biology might see data as the molecules which make up a cell of information. The word
‘molecules’ is carefully suggested instead of ‘atoms’ given that ‘atomicity’ has particular
significance for relational databases. Permitting an extension of the analogy would see a body
of knowledge built from the cells of information. It would be unwise to stretch the analogy
further to address wisdom. Unhelpfully, the words ‘data’ and ‘information’ are often
interchangeable terms in research literature. Some examples of this are the concepts ‘Big
Data’ and ‘unstructured data’ for what really ought to be called information. For this reason
and for the purpose of consistency this dissertation will hold with the literature and consider
the two terms as one except where a distinction is required.
A database has been defined in a number of sources as a “collection of related data or
information” (Bocij et al. 2006, p. 153; Elmasri and Navathe, 1989, p. 3).
The Oxford English dictionary defines a database as a “structured set of data held in a
computer” (OED). However, the Cambridge Advanced Learner’s online dictionary (2011)
definition is perhaps closer to a contemporary definition:
“A large amount of information stored in a computer system in such a way that it can
be easily looked at or changed”.
It is noted that the definition in the later online edition of the Cambridge (2011) does not have
any explicit reference to relational, structured or organised data. This looser definition
reflects the changing nature of data management as newer types and bigger volumes of data
are being captured.
Finally, a definition from the business world which expands on the above mentioning
different types of data and hints at the issues regarding scale:
A database is “a systematically organized or structured repository of indexed information
(usually as a group of linked data files) that allows easy retrieval, updating, analysis, and
output of data. Stored usually in a computer, this data could be in the form of graphics,
What is the future of the RDBMS in the Enterprise?
Page 6
reports, scripts, tables, text, etc., representing almost every kind of information.” (Business
Dictionary, 2011).
Structured and unstructured data.
The last definition above alludes to unstructured data. Unstructured data is data in the form of
text (words, messages, symbols, emails, sms texts, reports) or bitmaps (images, graphics). A
good example of the growing relevance of unstructured information is a Facebook page
containing images, short messages, links, and chunks of text that can be altered at any time.
Structured data by contrast is any data “that has an enforced composition to the atomic data
types” (Weglarz, 2004). Atomicity is a characteristic of stored entity which is not divisible
(Elmasri and Navathe, 1989, p. 41). Atomicity is a key necessity for defining structured data
and is what relational databases rely on to make relationships. A database designer can decide
on the exact rules for the structured data and the level of atomicity required. As an aside, it is
often this small amount of flexibility in the design of the data model which is responsible for
the creation of many ‘bad’ databases. Structured data is data that is consistent, unambiguous
and conforms to a predefined standard. Structured data will be examined in more detail later
under the section discussing RDBMS. A third type is semi-structured data. This is data held
in a standard format such as forms, spreadsheets and XML files. This type of data can be
parsed by computer programs more easily than unstructured data due to the data generally
being located in a fixed and known place, even if the data itself is not atomic.
The problem of structured versus unstructured data types can be stated using the example of
two schools. One school grades students in the traditional way by giving a numerical grade
following examination. Another school does not give numerical grade to students, preferring
a method whereby students are furnished with a qualitative report on their overall
performance. The former is structured data as the meaning of a grade of 82% is consistent in
the context of the schools grading system. It can be easily recorded, measured, and compared
to other grades internally or from other schools using the same system. The report format
however is unstructured and comparison with a numerical grading system is not so easy.
Gleaning relevant information from a text report is complex and involves semantic analysis
with or without the help of technology.
What is the future of the RDBMS in the Enterprise?
Page 7
What does this mean for enterprises?
Eighty percent of information relevant to business is unstructured and is mostly textual form
(Langseth in Grimes, 2011). Seth Grimes an analytics expert with the Alta Plana Corporation
has previously investigated this claim. He concludes that even if the origins of the 80% are
elusive (Grimes tracks back as far as the 1990’s) experience supports the claim (Grimes,
2011). Patricia Selinger (IBM and ACM Fellow) who has worked on query optimisation for
27 years puts unstructured data in companies at about 85% (Selinger, 2005). Even assuming a
lower figure than 80% for unstructured data in larger enterprises, where much information is
in structured forms held in traditional transaction based databases, there is still the problem of
how to leverage competitive advantage out of the nuggets of information buried in the rich
seams of unstructured data. Businesses are realising that the chances of extracting valuable
wisdom from traditional data stores using stale analysis methods and tools are diminishing
and that new ideas are needed.
Unstructured data is growing faster than structured data, according to the "IDC Enterprise
Disk Storage Consumption Model" 2008 report, “while transactional data is projected to
grow at a compound annual growth rate (CAGR) of 21.8%, it's far outpaced by a 61.7%
CAGR prediction for unstructured data” (Pariseau, 2008).
Kevin McIssac (2007) of Computer World magazine puts it into perspective:
“Unfortunately business is drowning in unstructured data and does not yet have the
applications to transform that data into information and knowledge. As a result staff
productivity around unstructured data is still relatively low.”
McIssac gives examples of the impact of unstructured data on productivity citing research
from various sources. Table 2.1 below summarises those impacts:
What is the future of the RDBMS in the Enterprise?
Page 8
Time/Volume Impacts on Research Source
9.5 hours per
week
Average time an office worker spends
searching, gathering and analysing
information (60% of that on the Internet)
Outsell
10% of working
time
Time professionals in creative industry
spend on file management.
GISTICS
600 e-mails per
week
Sent and received by a typical business
person.
Ferris Research
49 minutes per
day
Time an office worker spends managing e-
mail. Longer for middle and upper
management.
ePolicy Institute
Table 2.1 - Impact of unstructured data on productivity.
Where are the joins?
It seems that a reappraisal of what a database is or needs to do is well under way. If this is so,
then this reappraisal logically extends to the database management system. Structured data
can be joined to other structured data to form concatenations of information using a query
language based on mathematical operations. Things get a little more ‘fuzzy’ with
unstructured data. Stock market analysts might like to try querying an online media sources
for all posts where the word ‘oil’ is used but only in the context of the recent crises in Libya.
How unstructured and unrelated data is to be stored in the system and how meaningful
information can be retrieved back out of that same system are questions many organisations
are now asking – but, similar questions were asked before and the past may hold some
lessons for us.
What is the future of the RDBMS in the Enterprise?
Page 9
A DBMS
In its simplest definition a DBMS is a set of computer programs that allows users to create
and maintain a database (Elmasri & Navanthe, 1989 p. 4). Bocij et al. (2006, p. 154) expands
on this definition a little: “One or more computer programs that allow users to enter, store,
organise, manipulate and retrieve data from a database.”
(Source: Elmasri and Navathe, 1989 p. 5)
Figure 2.1 - A simplified DBMS
Figure 1 above shows the key components of a data management system. A detailed
description of each of the components of the system is not necessary for our purpose but
briefly they are:
• Application programs with which users can interact with the stored data.
• Software programs for processing and accessing the stored data.
• A high-level declarative language interface for executing commands (commonly
known as a query language).
What is the future of the RDBMS in the Enterprise?
Page 10
• A repository for storing data.
• A store of information related to the data for classifying or indexing purposes (meta-
data)
• Hardware suitable for each of the above functions
• Users (includes database administrators and designers)
2.2.1 History of the RDBMS
To understand why newer types of databases and data management systems are emerging and
taking hold it seems reasonable to explore why RDBMS’ came into existence, as well as their
usefulness and relative longevity.
The 1960’s BC (Before Codd)
Data management systems existed before Edgar Codd, while at IBM, wrote his seminal paper
published in 1970 called “A Relational Model of Data for Large Shared Data Banks”. Codd’s
paper presented a new database model and hence introduced the world of database
management to relational theory (Codd, 1970). In his paper Codd discusses the limitations of
the existing hierarchal and network data systems and introduces a query language based on
relational algebra and predicate calculus.
In a later important paper he described 12 rules for a relational database management system
(Codd, 1985). Systems that satisfy all 12 rules are rare. In fact, it is argued that no truly
relational database systems existed in wide commercial production even a decade after
Codd’s vision (Don Heitzmann in Thiel, 1982), and even up to more recently (Anthes, 2010).
A brief description of the two data management systems (of whose limitations Codd
addressed) is a useful precursor to a broader description of relational DBMS’.
Hierarchal Data Models
Hierarchal data models are similar to tree-structured file systems in that the data is stored as
parent-child relationship. Codd asserts that hierarchal and network based DBMS’ were not
data models in comparison to his more formalised Relational model. (Codd, 1991). For
simplicity the word ‘model’ is maintained for the data structure of all systems under
What is the future of the RDBMS in the Enterprise?
Page 11
discussion here. The model made sense to organisations that were naturally hierarchal in
nature - a legacy of Henri Fayol and his 14 management principles, popular in the 1960’s and
still used in organisations today (Stoner and Freeman, 1989; Tiernan et al., 2006). A
hierarchal data model can be presented as a tree-structure of parent-child relationships or as
an adjancy list. For example: a root entity with no parent might be SCHOOL; STUDENT is a
child of SCHOOL; GRADE is a child of STUDENT. STUDENT is also a child of COURSE.
In this type of structure data can be replicated many times in different branches of the tree, a
relationship of ‘one to many’ or 1:N. A ‘modified preorder tree traversal' algorithm is used to
number each entity on the way down through the tree-structure (left value) and again on the
way back up to the root (right value). Thus, making the query operations more efficient in
navigating around the data (Van Tulder, 2003).
The first hierarchal DBMS was developed by IBM and North American Aviation in the late
1960’s (Elmasri and Navathe, 1989 p. 278). IBM imaginatively called it Information
Management System (IMS) and Frank Hayes dates its roll out to 1968 (Hayes, 2002). Elmasri
and Navathe cite McGee (1977) for a good overview of IMS (1989, p. 278).
Network Data Models
As can be seen in the hierarchal data model above a child could have many parents. A
STUDENT for instance, can take more than one MODULE in any COURSE YEAR. In a
hierarchal structure the same STUDENT would appear under each of the MODULE trees. In
other words many students can take many modules. The Network data model was a further
development of the hierarchal model to address the issue of managing ‘many to many’ (M:N)
relationships. The Conference on Data Systems Languages (CODASYL) defined the network
model in 1971 (Elmasri and Navathe, 1989).
Where the underlying principle of the hierarchal model was parent-child tree structures, in a
network model it is set theory. Records are classified into record types and given names.
These records are sets of related data. Record types are akin to tables in a relational database
model. The intricacies of set theory are beyond the scope of this dissertation; however, it
suffices to say that complex data combinations can be achieved by nesting record types
within other record types – data sets as members of other data sets. If this were possible in a
relational database it would be like having tables within tables within tables.
What is the future of the RDBMS in the Enterprise?
Page 12
The earliest work on a network data model was carried out by Charles Bachman in 1961
while working for General Electric. His work resulted in the first commercial DBMS called
Integrated Data Store (IDS) which ran on IBM mainframes. The system was cumbersome and
was eventually redeveloped by an IDS customer, BF Goodrich Chemical Company into what
was called IDMS (Hayes, 2002). With Bachman on board as a consultant, IDMS was
eventually commercialised by Cullinane/Cullinet Software in the 1980’s. Cullinet was bought
by Computer Associates (CA) in 1989. IDMS is a current offering by CA for mainframe
database management systems today. Charles Bachman received the Turing Award in 1973
for his pioneering work in developing the first commercially available data management
system, for being one of the founders of CODYSYL and for his work on representation
methods for data structures (Canning in Bachmann, 1973).
The 1970’s
Adabas DBMS was developed in the 1970 by Software AG. It has an interesting feature of
relevance to this dissertation. Adabas was designed to run on mainframes for enterprises with
large data sets and requiring fast response times for multiple users. One of its main features is
that it indexes data using inverted-list type indexing.
Adabas also features a data storage address convertor which avoids data fragmentation. Data
fragmentation can occur when a record is updated with additional data. The record is now too
large to be stored in the original location. The data can be moved to a new location but the
indexes still expect the data to be in the same place so they also have to be updated. The
address convertor does this. The alternative as used by other systems is data fragmentation;
part of the data is stored in the original location with a pointer to where the remainder is
stored. Fragmentation and pointer methods however require additional processing and hence
slower response times. The problem of using pointers in systems predating RDBMS instead
of storing data directly (in tuples as is done in RDBMS) is referred to by IBM’s Irv Traiger
(in McJones, 1997 pp. 16-17).
According to Carl Monash, Adabas’ inverted-list indexing is the favoured method for
searching textual content. New ideas regarding the management of text (unstructured data)
has according to Monash “at least the potential of being retrofitted to ADABAS, should the
payoff be sufficiently high” (Monash, Dec 8 2007).
What is the future of the RDBMS in the Enterprise?
Page 13
Edgar Codd and the birth of the Relational Model
Codd’s text ‘The Relational Model for Database Management’ of 1990 (version 2, 1991)
brings together his ideas set out in his previous papers regarding Relational Data Model for
managing databases. In it he places his model as solidly based on two areas of mathematics:
Predicate Logic and Relational Theory. In order for the maths to work effectively, there are
four essential concepts associated with the relational model: domains, primary keys, foreign
keys and no duplicate rows. In particular, the importance of Domains has not been
understood fully or adopted by later commercial versions of his RDBMS (Codd 1991, pg18).
Also, two early prototypes IBM’s System R and Berkley University and Michael
Stonebraker’s INGRES were not concerned about the need to address the issue of duplicated
rows. The designers of both those systems felt that the additional processing required to
eliminate duplicate rows was unnecessary given the relative benign presence of duplicate
rows (Codd, 1991, p. 18). Codd’s purer model based on mathematic principles gave way to
the more pragmatic needs of the commercial world.
2.2.2 Main Features of ‘true’ RDBMS
The main features of a Relational DBMS as proposed by Codd, distinguishes a ‘true’
Relational DBMS from other DBMS’. Based on his earlier paper setting out his 12 Rules
(1985), they are summarised as follows:
• Database information is values only and ordering is not essential (meta data while
required should not be of concern to the everyday user; pointers are not used)
• Data management is not dependant on position within the structure (contrast with
Hierarchal and Network models).
• Duplicate rows are not allowed.
• Information should be capable of being moved without impact on the user.
• Three level architecture of the RDBMS – base relations, storage, views (derived
tables).
• Declarations of domains as extended data types.
What is the future of the RDBMS in the Enterprise?
Page 14
• Column description should be akin to the domain it belongs to (i.e. a good naming
convention).
• Each base relation (R-Table) should have one and only one primary key column,
where null value entries are not allowed.
• RDBMS must allow one or more columns to be assigned as foreign keys.
• Relationships are based on comparing values from common domains.
This last point is crucial to understanding Codd’s intention. Only values from common
domains can be properly compared – currency with currency, euro with euro, date with date,
integer with integer etc. The basis for this lies with the nature of the mathematical operators
used in the system. Consistency of data types and strict rules are therefore vital for the
effective operation of the system. Herein lays one of the difficulties presented to designers of
commercial versions of Codd’s RDBMS. Users of data management systems are presented
with real world scenarios where consistency is not always practical. It would be ridiculous to
ask members of a social networking site to use standard forms for communicating so that the
DBMS could store the relevant information appropriately. Even closer to the relational
database world a transaction record could be created for a person called William Thomas as
follows:
Instance Surname Forename Address DOB ID Order No
1 Thomas William 22, Greenview Street 12/06/1945 1234 104
2 Thomas Bill 22 Greenview St. 12/06/1945 1365 104
3 Thomas William H. 22, Greenview Street 12/06/1945 3456 104
Table 2.2 – Example of redundant rows in a database
As can be seen in this simple example above, the database treats these as three distinct and
unique records, even though the intention is that only one record for this person should exist.
The result impacts on the size, processing speed and integrity of the system. Techniques to
address such problems (primarily data normalisation) were developed almost from the
beginning, in the early 1970’s by Codd and later by Raymond Boyce and Codd (Elmasri and
What is the future of the RDBMS in the Enterprise?
Page 15
Navathe, 1989, p. 371). Database normalisation is beyond the scope of this dissertation,
however the salient point and (and the reason for our initial hypothesis) is that the nature and
amount of unstructured data flowing in the electronic ether has pushed RDBMS and its
associated control and optimisation processes to the limits of their capabilities.
Debashish Ghosh of Anshin Software while advocating the merits of non-relational models
nevertheless puts it fairly…
“A relational data management system (RDBMS) engine is the right tool for handling
relational data used in transactions requiring atomicity, consistency, isolation, and
durability (ACID). However, an RDBMS isn’t an ideal platform for modelling
complicated social data networks that involve huge volumes, network partitioning,
and replication”. (Ghosh, 2010)
The above discussion is intended to provide an important distinction between Edgar Codd’s
original theory of a relational data management system and subsequent versions developed
for the commercial enterprise market (mainframe computer market at that time). The
importance of the mathematical principles (Relational Algebra and Calculus) behind Codd’s
ideas are not underestimated, nor are the associated operations based upon those principles, in
fact they are key to understanding why Codd at the time persisted in pushing for a full and
true implementation of his model, and it may also explain also why he stepped back from the
first experiments in commercialising his ideas (Chamberlin and Blasgen in McJones, 1997 p.
13). Brevity here forces us to move on to look at two of the earliest commercial versions of
RDBMS that by no accident are also the two market leaders today.
As an aside, Appendix 1 presents of useful comparison of the key terms from Codd’s original
intended meaning and their relationship to other systems.
2.2.3 IBM, Ellison and the University of California, Berkley
IBM
One artefact cited several times in this section on the history of data management systems is a
transcript from a reunion meeting in 1995 of some of the original IBM research employees,
who during the 1970s and 1980s were at the coal face of data management development. The
article edited by Paul McJones is entitled “The 1995 SQL Reunion: People, Projects, and
What is the future of the RDBMS in the Enterprise?
Page 16
Politics” (McJones, 1997). At first, what seems like the convivial reminiscences of middle
aged ex IBM colleagues, in fact turns out to be a rather more interesting illumination of the
context around the timelines for the development of some of the most important ideas to
emerge, as well as the historically important players and products from the realm of database
management. Some of the key people attending the reunion and contributing to the discussion
are: Donald Chamberlin, Jim Gray, Raymond Lorie, Gianfranco Putzolu, Patricia Selinger,
and Irving Traiger. All are IBM and ACM Fellows and award winners for their work. Jim
Gray, fellow Berkley graduate and mentor to Michael Stonebraker was given the ACM
Turing Award in 1998 for his work on transaction processing (ACID) (Stonebraker, 2008).
Patricia Selinger was awarded the ACM Edgar Codd Innovation Award for her work in query
optimisation. Their contributions were vital to the features of commercial RDBMS which has
ensured its longevity thus far and possibly for many years yet.
IBM and System R
Midway through the 1970s IBM’s San Jose based research lab began working on a project
called System R. Like many IBM research projects at the time it came out of different task
groups working on related areas such as data language, data storage, optimisation, concurrent
users, and system recovery. System R was relational based and combined work from various
groups. System R as a commercial RDBMS was installed in Prat & Whitney Aircraft
Company in Hartford Connecticut in 1977 where it was used for inventory control. However,
IBM was not yet interested in releasing it as fully featured product. At that time the big IBM
cash cow was IMS (its mainframe Network model DBMS mentioned earlier). And the
research focus was on a project called Eagle – a replacement for IMS with all the new
features of recent discoveries. With the pressure off, the System R developers plugged away,
aiming it towards the lower midrange product line (Jolls in McJones, 1997, p. 31). Two
things happened at the time which resulted in the focus coming back on System R and getting
it ready for market (McJones, 1997, pgs 33-34). Firstly, IBM was starting to loose ground to
new mini computers (Gray in McJones, 1997, pg 20) and secondly the Eagle project was
hitting a wall. System R unlike Eagle was relational and already pitched towards the smaller
computer range. The System R star did not shine for long and it was replaced by DB2 with
Release 1 in 1980. IBM fully embraced relational DBMS with Release 2 around 1985 (Miller
in McJones 1997, p. 43). DB2 is IBM’s current offering and is mentioned again under the
section on the RDBMS market.
What is the future of the RDBMS in the Enterprise?
Page 17
The Birth of SQL
In and around the same time that System R was being developed, the language research team
at IBM, Relational Data Systems (RDS) took on Codd’s two mathematical based languages
for data management, relational algebra and relational calculus. By their own admission they
found these mathematical notations too abstract and complex for general use. They developed
a notation which they called SQUARE (Specifying Queries as Relational Expressions),
(Chamberlin in McJones et al., 1997 p. 11)
SQUARE had some odd subscripts so a regular keyboard could not be used. RDS further
developed it to be closer to common English words. They called the new version Structured
English Query Language or SEQUEL. The intention was to make interaction with databases
easier for non-programmers. However its biggest impact came later when Larry Ellison (co-
founder and CEO of Oracle) read the IBM published papers on SEQUEL and realised that
this query language could act as an intermediary between different systems (Chamberlin in
McJones et al., 1997 p. 15). It was the RDS team at IBM who renamed it to SQL following a
trademark challenge to the term SEQUEL from an aircraft company (McJones et al, 1997, p.
20)
INGRES
In parallel with the work going on at IBM, the University of California at Berkley had a
project developing a system called INGRES (short for Interactive Graphics Retrieval
System). Michael Stonebraker who was at Berkley in 1972 was developing a query language
called QUELL. Stonebraker knew fellow Berkley graduates at IBM San Jose and more
importantly knew of their work. INGRES used QUELL whereas IBM and Larry Ellison’s
project at Software Development Laboratories (later Oracle) used SQL. Subsequent off
spring of the INGRES family are Sybase and Postgre (post Ingres). Incidentally, Microsoft
struck a deal with Sybase to use their code for their new extended operating system.
Recalling that the Sybase people were brought up in the QUELL tradition under Stonebraker,
Microsoft preferred SQL. They eventually fell out and Microsoft who now owned the Sybase
code ended up developing Microsoft SQL Server (Gray in McJones, 1997 p. 56).
What is the future of the RDBMS in the Enterprise?
Page 18
Oracle
In 1977 Larry Ellison, Bob Miner and Ed Oates founded Software Development Laboratories
(SDL), the precursor to Oracle Corporation. SDL based its system on a technical paper in an
IBM journal (Oracle History, 2011). That was Edgar Codd’s 1970 seminal paper setting out
his model for a RDBMS (Traiger in McJones et al., 1997). SDL’s first contract was to
develop a database management system for the Central Intelligence Agency (CIA) - the
project was called ‘Oracle’. SDL finished that project a year early and used the time to
develop a commercial RDBMS putting together the work done by IBM research on relational
databases and as mentioned above another project on working on the query language called
SEQUEL. While Ellison and SDL benefited from the work done at IBM they still had to do
all the coding. The resulting product was faster and a lot smaller than IBM’s System R. The
first officially released version of Oracle was version 2 in 1979.
Brad Wade jokes about Edgar Codd’s influence on Oracle - on Codd being made an IBM
Fellow in 1976, “It’s the first time that I recall of someone being made an IBM Fellow for
someone else’s product” (Wade in McJones, 1997, pg 49.)
It appears that many new enterprises sprang from the well of knowledge existing at IBM
during the 1970’s and 1980’s. Had the IBM research units not had so much talent, nor not
allowed publication of key papers at the time, the database world might look very different
today. Patents on software were prohibited by IBM, and also in fact by Supreme Court law
until 1980 (Bocchino, 1995). According to Franco Putzolu, IBM Research at that time and up
until 1979 were “publishing everything that would come to mind” (in McJones, 1997, p. 16).
Mike Blasgen argues that the outside interest in the published research was one reason why
the corporate machine of IBM began to notice some of the lesser research projects (in
McJones, 1997 p. 16).
It is hoped that the above overview gives the reader some understanding of the related threads
that developed out of Charles Bachman’s initial work on data management systems, through
IBM via Edgar Codd and out into the wide world via IBM research department’s open
attitude to sharing knowledge, of which Larry Ellison’s Oracle benefited greatly. Berkley
played its role also in the providing a common alma mater for young enthusiastic developers
to discuss ideas. It is an interesting irony that when we think of ‘open source’ we envision a
What is the future of the RDBMS in the Enterprise?
Page 19
recent phenomenon, however, IBM during the 1970’s would appear to have been a little
more open, for whatever reasons, than is usually accredited to them.
2.3 New Databases
This section will explore the development of new DB’s that have emerged on the database
market over the past decade, and what impact these DB’s will have on the general database
market as a whole.
What are ‘New DB’s’?
Traditional databases rely on a relational model in order to function. That is, they follow a set
of rigid rules to ensure the integrity of the data in the database. Most RDBMS models follow
the set of rules, originally outlined by Edgar Codd (1970).
New NoSQL database models don’t follow all of the rules set down by Codd. While
RDBMS’ models follow the set of properties called ACID as previously stated, NoSQL
database models do not. They follow any number of database properties including BASE
(Basically Available, Soft state, Eventual consistency) (Cattell, 2011) and CAP (Consistency,
Availability and Partition tolerance).
Why the development of NoSQL model databases?
Development of NoSQL databases was as a result of the evolution of the World Wide Web,
and the desire of individuals and companies/organizations to generate data, large amounts of
it (White, 2010, p. 2). By collecting data, organizations then had extract value from that data
in order to be successful in whatever field they participated, in the future.
The problem organizations faced in extracting value from that data were twofold:
1. As storage capacities increased, the means of transferring the data to the drive(s) did
not keep up. Twenty years ago, a hard drive could store 1.3 GB of data, while the
speed at which the entirety of the data could be accessed was 4.4 MB per second;
about five minutes to access it all. Today, 1 TB hard drives are the norm, but access
What is the future of the RDBMS in the Enterprise?
Page 20
speeds are about 100 MB per second; an access speed decrease of a factor of 30
(White, 2010, p. 3).
A means of getting around this bottle neck was the introduction of disk arrays,
whereby data could written and read from multiple disks in parallel. The drawback to
this was the possibility of hardware failure, whereby a disk or machine would fail and
the data lost (White, 2010, p. 3). Redundancy (various options of RAID being the
most famous examples) solved some of these problems but not all (Patterson, 1988).
2. The second problem is that with multiple disks, relational database models, with their
inbuilt consistency requirements, are unable to access data quickly enough when the
data is spread across multiple disk drives. RDBMS systems may not be able to allow a
query to access certain data if that data is already in use by another program or user
(Chamberlin, 1976).
2.3.1 Features of NoSQL Databases
In order for a Database to be considered a NoSQL database, it first must not comply with the
entirety of ACID properties. Amongst the features that define NoSQL databases include
Scalability, Eventual Consistency and Low Latency (Dimitrov, 2010). A key feature of
NoSQL databases is a “shared-nothing” architecture. This means databases can replicate and
partition data across multiple servers. In turn, this allows the databases to support a large
number of simple read/write operations per second (Cattell, 2011).
Scalability
With traditional RDBMS systems, a database was usually required to scale up, that is, switch
over to a newer, larger capacity machine, if the database is to expand capacity (Cattell, 2011).
One of the features designed into some NoSQL databases is their ability to scale to large data
volumes without losing the integrity of the data. With NoSQL, as systems are required to
expand with an influx of additional data, they scale out by adding more machines to the data
What is the future of the RDBMS in the Enterprise?
Page 21
cluster. With this scaling, NoSQL systems can process data at a faster speed than RDBMS, as
they are capable of spreading the workload of the processing over numerous machines
(Cattell, 2011).
Eventual Consistency
Eventual Consistency was pioneered by Amazon using the Dynamo database. The purpose of
its introduction was to ensure High Availability (HA) and scalability of the data. Ultimately,
data that is fetched for a query is not guaranteed to be up-to-date, but all updates to the data
are guaranteed to be propagated to all copies of the data on all nodes of the cluster eventually
(Cattell, 2011).
This ensures that databases are accessible to programs and individuals whom wish to read or
modify data, without the constraints of being locked out of a database or data field while the
data is currently being updated or read, as is the case with RDBMS databases models.
Low Latency
Latency is an element of the speed of a network. It refers to any number of delays that
typically occur in the processing of data (Mitchell, no date). In the case of NoSQL databases,
it means that queries can access the data and return answers more quickly than RDBMS
because the data is distributed across multiple nodes of a cluster, instead of one machine.
This results in a faster response time. Causes for high latency in traditional RDBMS model
databases include the seek time of hard disks (Mitchell, no date), the speed of the network
cables that run on the machines, and the bad programming of queries (Stevens, 2004)
(Souders, 2009).
NoSQL database models
Unlike RDBMS models, NoSQL data models are often inconsistent. For storage purposes,
NoSQL databases have a number of data model categories, which are listed below:
Key-value Stores
Databases that have this model use a single key-value index for all the data. These systems
provide persistence mechanisms as well as additional functions such as replication, locking,
What is the future of the RDBMS in the Enterprise?
Page 22
transactions and sorting. NoSQL databases such as Voldemort and Riak use Multi-Version
Concurrency Control (MVCC) for updates. They update data asynchronously, so they cannot
guarantee consistent data (Cattell, 2011).
Key-value store databases can support traditional SQL functionality, such as the ability to
delete, insert and lookup operations (Cattell, 2011).
Document Stores
This model supports more complex data than key-value stores. They can support secondary
indexes and multiple types of documents per database. A number of database models using
this include Amazon’s SimpleDB and CouchDB
Document Store databases provide a querying mechanism for the data they contain using
multiple attribute values and constraints (Cattell, 2011).
Extensible Record Stores
Influenced by Google’s Bigtable, Extensible Record Store databases consist of rows and
columns, which are scaled across multiple nodes. Rows are split across nodes by ‘sharding’
the primary key. This means that querying a range of values does not have to go to every
node. Columns are distributed over multiple nodes by using ‘column groups’. This allows the
database customer to specify which columns are best stored together, which has the added
advantage of being able to be queried faster, as all the most appropriate data for a query is
most likely close at hand: e.g., name and address (Cattell, 2011).
The most famous examples of an Extensible Record Store database available, save Google’s
proprietary Bigtable, are HBase and Cassandra. Additional databases that use the model are
Hypertable, sponsored by Baidu (Hypertable, 2011), and PNUT (Yahoo Research, 2011).
What is the future of the RDBMS in the Enterprise?
Page 23
Graph Databases
A graph database maintains one single structure – a graph (Rodrieguez, 2010). A graph is a
flexible data structure that allows for a more agile and rapid style of development (Neo4J,
2011).
A graph database has three main attributes:
1. Node – the location of the machine in which the data is stored
2. Relationship – this is a label given to the data item, which determines which data
in the same or other node that the original data is related too.
3. Property – this is the attribute of the data. (Neubauer, 2010)
The purpose of graph databases is to quickly determine the relationships between different
items of data. Examples of graph databases include the Neo4j database and Twitter’s
FlockDB, which is used to join up the tweets between those who post them and all of their
followers (Weil, 2010).
2.3.2 Hadoop
Hadoop/MapReduce
Hadoop is a distributed database model originally developed by Doug Cutting at Yahoo
(White, 2010, p. 9), using Google’s proprietary Bigtable database as a model (Apache, 2011).
Throughout its short history, developers have added components that allow Hadoop to
process the data that it collects more efficiently
Hadoop contains a number of components that allow the system to scale to large clusters of
machines, without impacting the overall integrity of the data stored on those machines. The
main component of Hadoop is MapReduce.
MapReduce is a framework for processing large datasets that are distributed across multiple
nodes/servers. The ‘map’ part of the framework takes the original inputted data and partitions
the data, distributing the original input to different nodes. The individual nodes can then, if
necessary, redistribute the data again to other sub-nodes. MapReduce then applies the map
What is the future of the RDBMS in the Enterprise?
Page 24
function in parallel to every item in the dataset, producing a list of pairs for each query
(White, 2010, p. 19). The ‘reduce’ part of the framework then collects all of the common key
values, sums them up, and returns a single output for the keys and a value(s). The reduce
function, in effect, removes duplication within the system, allowing queries to return results
more speedily (White, 2010, p. 19).
Hadoop is designed for distributed data, with a dataset split between multiple nodes, if
necessary. If MapReduce must query data that is located on multiple nodes, then the map
function will map all the data for the query that is located on a single node, and return the
result. It will do the same query on all nodes that the relevant data is located on. The reduce
function will then take all those map results and reduce them down to single values, again to
return the query result(s) (White, 2010, p. 31).
Both functions are oblivious to the size of the dataset that they are working on. As such, they
can remain the same irrespective of the size of the dataset, large or small. Additionally, if you
double the input data, the job will run twice as slow; however, if you double the size of the
cluster, a job will run as fast as the original one (White, 2010, p. 6).
HDFS
HDFS is the file system that allows Hadoop to distribute data across multiple
nodes/machines. HDFS stores data in blocks, similar in fashion other file systems. However,
while other file systems have small sized blocks, HDFS, by default has large size blocks. This
is to reduce the number of seeks that Hadoop must make in order to return a query, speeding
up the process (White, 2010, p. 43).
2.3.2.1 Components of Hadoop
HBase
Based on Google’s Bigtable, HBase was developed by Chad Walters and Jim Kellerman at
Powerset. The purpose of the development of HBase was to give Hadoop a means of storing
large quantities of fault-tolerant data. It can also sit on top of Amazon’s Simple Storage
Service (S3) (Wilson, 2009). HBase was developed from the ground up to allow databases to
What is the future of the RDBMS in the Enterprise?
Page 25
scale just by adding more nodes – machines – to the cluster that HBase/Hadoop is installed
on. As it does not support SQL, it can do what an RDBMS database cannot; host data on
sparsely populated tables, located on clusters made from commodity hardware (White, 2010,
p. 411). The structure of HBase is designed with a ‘master node’, which has control of any
number of ‘slave nodes’, called Region Servers. The master node is responsible for assigning
regions of the data to the region servers, as well as being responsible for the recovery of data
in the event of a region server failing (White, 2010, p. 413). In addition to this setup, HBase
is designed with fault tolerance built in – HBase, thanks to HDFS, creates three different
copies of the data spread across different data nodes (Dimitrov, 2010).
Hive
Hive is a scalable data processing platform developed by Jeff Hammerbacher at Facebook
(White, 2010, p. 365). The purpose of Hive is to allow individuals whom have strong SQL
skills to run queries on data that is stored in HDFS.
When querying the dataset, Hive first tries to convert SQL queries into MapReduce jobs, as
well as custom commands that allow it to target different partitions within the HDFS dataset,
allowing users to query specific data within the Hadoop cluster (White, 2010, p. 514). This
allows Hive to provide users with a traditional query model from older RDBMS
environments within the newer distributed NoSQL database environments.
2.3.3 Cassandra
Cassandra is a fault tolerant, decentralised database that can be scaled and distributed across
multiple nodes (Apache, 2011 - Lakshman, 2008). Developed by Avinash Lakshman at
Facebook (Lakshman, 2008), Cassandra is now an open source project run by the Apache
Foundation (Apache, 2011).
Initially designed to solve a search indexing problem, Cassandra was designed to scale to
very large sizes across multiple commodity servers. Additionally, the ability to have no single
point of failure was built into the system (Lakshman, 2008). Since Cassandra was designed to
scale across multiple servers, it had to overcome the possibility of failure at any given
location within each server, such as the possibility of a drive failure.
What is the future of the RDBMS in the Enterprise?
Page 26
To guard against such a possibility, Cassandra was developed with the following functions:
Replication
Cassandra replicates data across different nodes when written too. When data is
requested, the system accesses the closest node that contains the data. This ensures
that data stored using Cassandra maintains High-Availability (HA), one of the core
attributes of a NoSQL database. Once data is written to a server, a duplicate copy of
the data is then written to another node within the database (Lakshman, 2008).
Eventual Consistency
Cassandra uses BASE to determine the consistency of the database. In order for data
to be accessible to users, an individual whom is reading the data accesses it on one
node. At the same time, another individual can be making changes to another copy of
the data on another node. As the data is replicated, newer versions of the data are
sitting on one node, while older versions are still active on other nodes (Apache wiki,
2011).
Users of Cassandra can also determine the level of consistency, allowing writes to add
or edit data to a single copy of the data in a node, or, if possible, to write to all copies
of the data across all nodes (Apache wiki, 2011).
Scalability
Data that is stored on Cassandra is scalable across multiple machines. Such elasticity
is possible because Cassandra allows the adding of additional machines to the cluster
when required (Apache, 2011).
What is the future of the RDBMS in the Enterprise?
Page 27
2.4 The market for RDBMS’ and Non-Relational DBMS’
2.4.1 Introduction
This section is to give an overview of the current market for both relational databases and
newer non-relational databases. This document will investigate both traditional vendor
database offerings as well as the proliferation over the past few years of a number of
community developed open source database offerings.
The Literature Review for determining the current market for both traditional relational
databases and ‘future’ non-relational databases utilised a variety of sources, including
Internet search queries to find relevant research material, as well as utilising the University of
Dublin (DU) library facilities to access academic and commercial research to which DU has
access to.
2.4.2. RDBMS Market
Today, many executives want business to grow based on data-driven decisions. As such,
analytics of data has become a valuable tool in Business Intelligence (BI). Many of the top
performing companies use analytics to formulate future strategies and guide them on the
implementation of day-to-day operations (LaValle et al, 2010). However, organisations are
gaining more and more data without the means of extracting value from that data (LaValle,
Hopkins, et al). This has resulted in a requirement for the adoption by companies of
enterprise solutions that can give an overview of the data being generated, using Online
Analytical Processing (OLAP) databases.
The Database Management Systems market is split into two segments; OnLine Transaction
Processing (OLTP) and OLAP / Data Warehousing (DW). OLTP systems are characterised
by the RDBMS options available from vendors in the market will generally target either of
these two segments.
The OLTP market targets clients that require fast query processing, maintaining of data
integrity in multi-access environments and a business model that has data measured by the
number of transactions per second that the database can handle. In an OLTP model database,
What is the future of the RDBMS in the Enterprise?
Page 28
there is an emphasis on detailed current data, with the schema to store the data being the
entity model BCNF ( Datawarehouse4u, 2009).
OLAP databases are characterised by a low volume of transactions, and are primarily
designed for data warehousing databases. As such, they are particularly useful for data
mining; whereby applications access the data to give an overview of current trends, business
performance and informational advantage. As such, OLAP databases are increasingly seen as
important for making Business Intelligence (BI) decisions (Feinberg, Beyer, 2010)
2.4.2.1 Vendor Offerings
Within the enterprise database market, the industry is dominated by a few big corporations
which include Oracle, IBM, Microsoft, Sybase and Teradata. Many of the database offerings
from these firms operate in the Data Warehousing sector, which contains most of the market
for enterprise database management systems. While the big players will have comprehensive
database offerings for their clients, the market is currently being disrupted by new entrants
whom are targeting niche areas, either focusing on performance issues related to their
offerings, or single-point offerings (Feinberg, Bayer, 2010).
Oracle
According to Gartner, Oracle is currently the No. 1 vendor of RDBMS’ worldwide (Gartner
in Graham et al, 2010), with a 50% share of the market for the year 2010 (Trefis, 2011). They
are forecast to improve this figure to 60% by 2016, driven by their sales of the Exadata
hardware platform. Leveraging the use of the high-end Exadata servers in conjunction with
Oracles’ database software is estimated to result in more efficient and faster Online
Transaction Processing (Graham et al, 2010).
Currently, Oracle generates 86% of revenues from its database software portfolio, with 8%
from its hardware portfolio. The future strategy of the company is to have clients purchase
complete systems – hardware and software – thus leveraging the power of the Exadata system
to get the most out of Oracle’s database technology. The result will be an increase in Oracle’s
revenues and its market share (Crane et al, 2011).
What is the future of the RDBMS in the Enterprise?
Page 29
IBM
IBM is one of the main vendors in the market, and is the only vendor that offers to its clients
an Information Architecture (IA) that spans all systems, which includes OLTP, DW, and
retirement of data (Optim tapes) (Henschen, 2011a). IBM’s main offering in the RDBMS
market is the DB2 database. DB2 runs on a number of platforms, including Unix, Linux and
Windows OS. DB2 can also run on the z/OS platform, where it is used to deploy applications
for SOA, CRM, DW and operational BI.
IBM’s RDBMS solutions are ranked no.2 behind Oracle worldwide (Finkle, 2008), however,
they are slowly losing market share to Microsoft and Oracle due to uncompetitive pricing for
their database as well as greater functionality that can be found from rival offerings.
Recently, IBM acquired Netezza (Evans, 2011), a company who provide a DW appliance
called TwinFin to clients. TwinFin is a purpose-built appliance that integrates servers, storage
and database into a single managed system (Netezza, 2011a). The reason IBM acquired
Netezza is the expected increase in revenues that Netezza will generate from its portfolio
(Dignan, 2010), as well as a lack of overlap in the customer base between IBM’s current
client list and that of Netezza (Henschen, 2011b). Additionally, the acquisition fits in with
IBM’s overall business analytics strategy, as IBM has marked BI as the key driver for IT
infrastructure needs (Gartner, 2010).
Microsoft
SQL Server from Microsoft is a complete database platform designed for applications of
various sizes. It can be deployed on normal servers as well as the ‘cloud’, allowing user
clients to scale SQL Server to their respective needs. Purely a software player, Microsoft
requires hardware partners to deploy its database offerings (Mackie, 2011).
Microsoft, however, finds itself more under threat from low-cost or ‘free’ open source
alternatives such as MySQL and PostgreSQL due to operating primarily in the low-end mid-
market segment (Finkle, 2008). As such, if its clients are looking at alternative options, SQL
Server may not be competitively priced for Microsoft to compete with open source RDBMS.
What is the future of the RDBMS in the Enterprise?
Page 30
SAP/Sybase
Sybase, recently acquired by SAP, has three main business areas: OLTP using the Sybase
ASE database, Analytic Technology using Sybase IQ, and, interestingly, Mobile Technology
(Monash, 2010). This deal was required by SAP as it was coming under increasing pressure
due to Oracle’s recent acquisition of SUN Microsystems, which gave Oracle a stronger focus
on integrated products based around databases, middleware and applications (Yuhanna,
2010).
The deal between SAP and Sybase gives both companies a lot of synergies – SAP finally
acquires an enterprise-class database in the form of Sybase IQ, which SAP can now offer to
its hundreds of client companies a database with columnar store and advanced compression
capabilities (Yuhanna, 2010).
A differentiator from SAP peers now comes with the acquisition of Sybase in the form of a
mobile offering. Sybase has a number of mobile products for enterprises, including the
Sybase Unwired Platform and iAnywhere Mobile Office suite. These technologies allow
companies to connect mobile devices to a number of back-end data sources (Sybase, 2011).
SAP now has the ability to offer its applications embedded in Sybase mobile platforms, using
the synergy between the two to improve its competitive advantage and expand to other
markets (Yuhanna, 2010). Indeed, efforts are now being made to cement Sybase’s lead in this
segment of the market, with an initiative to make the Android OS platform enterprise ready.
This involves porting Afaria, Sybase’s mobile device management and security solution, to
the Android platform (Neil, 2011). With the growth of Android now reaching 30% of the
smartphone market share in the United States (Warren, 2011), the future growth for Sybase in
the mobile enterprise market looks strong.
Finally, although big in the database market in the early 1990’s (Greenbaum, 2010), Sybase
has been considered the fourth database vendor behind Oracle, Microsoft and IBM for the
past decade. Its main market for Sybase’ OLTP offering, Sybase ASE, has been the financial
services sector, with little penetration in other enterprise sectors. It is expected that SAP will
make Sybase ASE more cost effective, and make another push in this segment of the market,
maybe at the expense of the big three (Yuhanna, 2010).
What is the future of the RDBMS in the Enterprise?
Page 31
Teradata
Teradata is a database vendor specialising in data warehousing and analytical applications
(Prickett Morgan, 2010). During the last year, it was considered the best placed amongst its
peers as a market leader in Data Warehousing (Feinberg, Bayer, 2011). This will be a hard
position for competitors to dislodge as products in the DW market are considered difficult to
replace (Bylund, 2011). Amongst its clients are multinational corporations such as 3M and
PayPal (Teradata, 2011).
One of Teradata’s products, the Teradata parallel database, designed for DW and OLAP
functions, has an update and support revenue stream, as well as additional functions that
customers are willing to pay for (Prickett Morgan, 2010).
However, Teradata specialises in a single area of the database market – DW and analytics
(Prickett Morgan, 2010). As such it is exposed to any weakness that may occur within that
segment of the market. The company’s recent acquisition of Aprimo, an enterprise marketing
firm with a strong emphasis on Marketing Research Management (MRM) and Campaign
Management (CM). CM is considered by some as mission critical, as it allows marketers to
unlock the value of customer data to develop multi-channel communications. Such an
acquisition adds value to Teradata’s product portfolio, without competing with Teradata’s
current product range, allowing the company to diversify its offerings to clients and future
customers (Vittal, 2010).
EMC/Greenplum
Greenplum, a DW and Analytics firm acquired by EMC in 2010, is the foundation of EMC’s
Data Computing division. Greenplum specialises in DW in the ‘cloud’, through its Chorus
platform (Greenplum, 2011).
EMC’s strategy for gaining market share is releasing a free community version of their
database for testing, with the intent that they eventually purchase a commercial licence. It’s
recently released ‘free’ Community Edition database, a heavily customised version of
PostgreSQL, is targeted at companies and developers for whom Greenplum’s previous
offering was not useful for creating parallel databases for DW and Analytics (Prickett
What is the future of the RDBMS in the Enterprise?
Page 32
Morgan, 2011). The purpose of the release is to allow developers to build and test Massive
Parallel Processing (MPP) databases. If in the event that clients who develop these systems
wish to use the software in a commercial environment, then they will be required to purchase
a licence for the Greenplum Grade 4.0 database, EMC’s commercial DW offering
(Kanaracus, 2011).
It is hoped by EMC that customers wishing to have greater functionality with Greenplum’s
database will upgrade to the Greenplum Grade 4.0 database (Kanaracus, 2011).
2.4.3 Non-RDBMS Market
Open Source Databases
There are a number of open source community developed database solutions available on the
market today. However, due to these offerings generally being ‘free’, they don’t show up
high on the list of databases in use by revenues earned – total deployment of open source
databases can rival the total number of deployments from traditional vendors (Von Finck,
2009).
All RDBMS applications hold a consistency model that can be inflexible for certain
applications. The requirement for a record or table to be locked out from being viewed or
otherwise accessed while changes are being made slows down queries that are attempting to
generate results for end-users.
Additionally, due to atomicity and consistency, not all RDBMS applications are scalable to
the requirements of organisations that hold large quantities of data, such as Google and
Facebook.
With databases now employed that have tables of sizes in excess of 10 TB, the ability to
query all that data will require speed and processing power that cannot be achieved to the
requirements of user companies by traditional RDBMS offerings. Newer non-relational
database offerings designed to meet these new requirements usually come in two options;
MPP systems and Column-Store databases (Henschen, 2010).
What is the future of the RDBMS in the Enterprise?
Page 33
With the introduction of the Bigtable Distributed Storage System on top of the Google File
System (GFS) in 2006 (Chang, et al, 2006), Google has demonstrated that non-relational
databases can be scalable over multiple machines. Due to Bigtable’s proprietary nature
however, efforts have been made over the past five years to develop open source versions of
Google’s software, resulting in the arrival of the Apache Foundation’s Hadoop, initially
developed by Yahoo (Bryant and Kwan, 2008). A number of companies have now utilised
Hadoop and associated software to allow themselves to scale their database offerings to their
own requirements.
The growth of Hadoop can be inferred by unusual avenues. From 2007 through to early-mid
2009, IT requirements for expertise in Hadoop or MapReduce within the London area was .4
of 1% of the jobs market. By January 2011, the figure had grown to 1.2%, a 300% increase in
the requirement for expertise within 2 years (IT Jobs Watch, 2011). Additionally, there was a
49% increase in Hadoop job postings in the United States from 2008 to 2009, with most of
the job offerings being in California (Lorica, 2009).
However, due to the lack of suitably qualified engineers for Hadoop and HBase within the
industry at present, development projects at a number of companies have been affected due to
the lack of staff. Within Silicon Valley, Google and Facebook are two companies that can
afford to remunerate staff competitively due to their large sources of revenue. This has
resulted in Cloudera, the Start-up cloud database company, being unable to offer top
engineers remuneration at similar levels to their competitors. Cloudera have had to be
imaginative in relation to its remuneration to staff. This includes setting up offices within
downtown San Francisco, with the intention that staff would prefer to work in that location
than Palo Alto or Mountain View, both 30 miles from the centre of San Francisco (Metz,
2011a).
Such constraints will result in a lack of projects for new NoSQL databases until an adequate
supply of qualified engineers become available, slowing growth for development and
adoption of this new technology for the foreseeable future.
What is the future of the RDBMS in the Enterprise?
Page 34
Cassandra
Cassandra is a distributed, column family database, developed at Facebook to solve an Inbox
Search problem (Lakshman, 2008). It is now an open sourced project from the Apache
Foundation (Apache, 2011).
In addition to Facebook, additional users of the Cassandra database include the social news
website Digg (Higginbotham, 2010), who decided to switch from MySQL to Cassandra due
to scalability issues with MySQL. The rational behind the move was the decentralised nature
of Cassandra and the fact that it has no single point of failure (Kerner, 2010). Unfortunately,
the changeover to Cassandra was not run smoothly, resulting in Digg having to revert to
MySQL to ensure data integrity, and allow its services to be available to its clients. The
episode highlighted the pitfalls of switching from one architecture framework to another
(Woods, 2010).
Taking advantage of Cassandra’s introduction to the market, is Datastax – formerly Riptano
(DBMS2, 2011), a start-up founded by the Cassandra project’s chair, Jonathan Ellis. The
purpose of Datastax is to take commercial advantage of Cassandra, by selling expertise and
technical support in Cassandra (Kerner, 2010), following the examples of Red Hat (Linux)
and Cloudera (Cloud Computing) (Subramanian, 2010).
HBase
HBase is a non-relational database built on top of the Hadoop framework, using the Hadoop
Distributed File System (HDFS). Originally developed out of a need to process large amounts
of data, HBase is now a top-level Apache Foundation project (Zawodny, 2007).
Due to HBases’ ability to scale to large sizes, the database has received attention within IT as
a platform that can meet various companies’ requirements. Recent corporate announcements
about their deployment of HBase, has increased the marketplace viability of HBase as a
NoSQL database option (Metz, 2011b). These include both Facebook and Yahoo, 2
companies with large repositories of data.
Facebook announced a new messaging platform, in which email, text messages and Instant
Messages (IM), as well as Facebooks’ own messaging system, would be integrated together
(Metz, 2010). Facebook experimented with a number of database offerings, including its own
Cassandra database to see if it could handle the new system. Additionally, they excluded
What is the future of the RDBMS in the Enterprise?
Page 35
MySQL due to scalability issues. Eventually, they chose HBase, due to its consistency, as
well as ability to scale across multiple machines (Muthukkaruppan, 2010).
HBase was deployed by Yahoo to handle its news aggregation algorithm. The purpose of the
new system is to data-mine content in order to optimise what the viewer sees on Yahoo’s web
portal. In order for Yahoo to deploy to the website front page the most relevant news stories
that people are viewing at any given moment in time, their requirement for the system was a
database that could quickly query in real-time the most relevant items that people are
interested in based on the number of clicks that story receives. Deployment of this new
system has resulted in an increase in traffic to the Yahoo web portal, and subsequently
resulted in an increase in revenues (Metz, 2008).
What is the future of the RDBMS in the Enterprise?
Page 36
2.5 Case Studies
2.5.1 Case Study 1- Utility companies and the data management challenge
Introduction
Utility companies are known to be one of the most conservative of enterprises when it comes
to investing in technology (Fink, 2010; Fehrenbacher, 2010). There are many reasons for why
this might be so; security of supply, regulatory compliance and financial austerity together
with a lack of business drivers often leaves the risk averse utility threading water when it
comes to IT investment (Tony Giroti, CEO Bridge Energy, 2011). However, things have been
changing over the last few years. According to recent research by Lux, utilities (mainly
power and water) will invest up to $34 billion in technology by the year 2020 (St. John,
2011). The reason arises from Smart Grid projects mainly and the growing avalanche of
associated data which utilities will need to manage (St. John, 2011). For utilities, the business
drivers required to justify investment in the kind of technology which enables integration of
data across key business units have only recently emerged. Real-time applications just
weren’t necessary before now (Giroti, 2011).
Utilities
History has shown how utilities are by and large reactionary when it comes to new ideas. For
example, a snapshot of energy utilities related articles in the Pro Quest database (available
through the TCD Library’s online resources) at various times over the last few decades shows
flurries of activity around key moments of change in the industry. Cyclical changes from
regulation to de-regulation of the energy sector in the early 1990s, begun in the US, kick-
started reactionary strategy changes within the energy industry. Ireland followed the pattern
with the Electricity Regulation Act of 1999 a program which is nearing completion. Fifty six
articles on related subjects between 1992 and 1994 in contrast to just eighteen in following
six years to the year 2000 (Pro Quest database) would seem to support this assertion.
In the last decade or so innovation for utilities centred around the technology enabling Smart
Grid and again an upsurge in articles on this subject stands out in a normally ‘steady state’
What is the future of the RDBMS in the Enterprise?
Page 37
sector. More recently the pressures of a diminishing supply and subsequent higher prices of
raw material for energy production have propagated a sustainability drive.
Compliance however has been a steady influence on energy utilities. What makes the Smart
Grid attractive is the way it forces efficiency throughout the energy supply chain from
generation to distribution resulting in less CO2 emissions – a major deliverable of the Kyoto
agreement. Related to this has been the drive towards sustainable energy generation and
supply. Vice President of Technology at Cobb Energy, Bob Arnett sums it up:
“In today’s world, where utilities are focused on environmental concerns, resource
constraints, and intelligent grids, it is sometimes hard to remember that in the mid-
Nineties, the word of the day was ‘deregulation’.”
(Arnett, 2011)
This case study looks at utility companies in the context of these three key drivers:
Regulation/Deregulation; Smart Grid and Sustainability. The case is stated in general terms
initially but quickly moves to more specific Smart Grid applications in electricity supply
companies, focussing in one Irish energy company’s use of databases in its implementation of
Smart Grid applications. As the ESB’s (Electricity Supply Board) Tom Geraghty said of
Smart Metering in a recent interview with Silicon Republic:
“How you get data back from the electronic metre to a utility central point where it is
aggregated and the bill is sent out to simply allowing people to top up their metre at
home as if it were a mobile phone shows you the complexity that lies ahead. There are
many imaginative options emerging and the opportunities are endless,”
(in Kennedy, 2011)
One estimation from Lux research puts the increase of data coming from the Smart Grid at
900% by 2020 (St. John, 2011). Tony Giroti puts this in more tangible terms- 1 million smart
meters passing data every 15 min equates to 30 TB of data per year to be handled, stored and
harvested (Giroti, 2011). This figure doesn’t include the real time data flowing through the
system as part of the self-healing attribute of Smart Grids.
What is the future of the RDBMS in the Enterprise?
Page 38
The problem can be placed within the wider question asked in this dissertation, that is, what
is the future of the traditional RDBMS in the enterprise? To this end, this case study
predicates that the general feeling towards newer database management solutions such as
open source and NoSQL is that while they are attractive for certain non-core applications,
they are not yet up to the task of the more serious mission critical functions of control
systems, financial transactions and customer management within enterprises. This study
investigates the problem in the context of traditionally risk averse utility companies and
questions if new business drivers (of which the Smart Grid is key) are forcing a rethink on
this issue.
A public utility company is an enterprise which provides key services to the public most
typically electricity, gas, water, and transportation. They may be state or private owned. They
may operate in a regulated, deregulated or even semi-regulated market (Legal Dictionary).
The energy sector in Ireland is currently under going dramatic change. The two largest
energy companies in Ireland, the Electricity Supply Board (ESB) and Bord Gáis are
commercially run enterprises and are both majority owned by the state. Both companies have
recently entered into each others markets as a result of the state’s requirement (and driven by
the EU) to open up the energy market in an attempt to improve competitiveness the sector for
the benefit of consumers (Irish Government White Paper, 2007).
One result of this restructuring of the sector is that the separate electricity and gas markets
have been combined and the sector is now generally referred to as the energy market. The
functions carried out by utility companies differ according to the services they provide.
Energy suppliers are similar in the functions they carry out such as generation, transmission
and distribution of energy. Water utilities in other countries have moved towards a revenue
generating model for water supply and Ireland rightly or wrongly may soon follow suit.
Each core function contains a number of supporting IT applications. Each of these in turn is
supported by a suitable data management system. Some of the major solutions used in energy
utilities include: Geographical Information System (GIS); Meter Data Management (MDM);
Customer Information System (CIS); Distribution Management System (DMS); Supervisory
Control and Data Acquisition (SCADA); and Outage Management System (OMS). Figure 2.3
shows where some of these systems fit into the overall network.
What is the future of the RDBMS in the Enterprise?
Page 39
Each of these systems provides support the specific needs of the different business functions,
such as, supply, generation, distribution, trading, and operations. As such they may or may
not be integrated. In relation to meter data management (MDM) Giroti again states the
problem succinctly in his paper entitled “You’ve Got the Meter Data – Now What?” (2011),
where he gives two options:
Have a proactive strategy for integrating and managing data coming from the Grid, or...
Be reactive in response to problems as they appear at the risk of being left behind by
competitors adopting the former strategy.
Smart Grid - The ESB case
The European Technology Platform definition of smart grids is -
“electricity networks that can intelligently integrate the behaviour and actions of all users
connected to it - generators, consumers and those that do both – in order to efficiently deliver
sustainable, economic and secure electricity supplies” (Smart Grids: European Technology
Platform, 2010)
Successful smart grid implementation depends on how enterprises utilise information systems
in managing the torrent of data heading their way. This issue puts data management systems
right back in the foreground of the IT game.
The ESB plans to invest up to €11 billion in sustainable projects including a Smart Grid
(Strategy Framework 2020). The ESB began a pilot project for advanced metering in 2007.
Advanced meters occupy what is termed the head end of the smart grid. They reside on
customer premises or at the company’s own locations typically at the edge distribution
network. The ESB has to date installed 6,500 smart meters. The estimated total installations
required for full implementation is over two million. The data consists of messages to and
from a central management system called a meter data management system (MDM). The
message can be meter data relating to load readings, voltage and temperature measurements,
outages, faults and other events.
The ESB’s existing data management platforms includes solutions from Oracle, IBM and
Microsoft. Currently no open source or NoSQL solutions exist in any official way in the
company. A preliminary evaluation of the open source database solution MySQL was carried
What is the future of the RDBMS in the Enterprise?
Page 40
out by the IT department in 2010 but no decision on implementation has been made as yet.
MySQL is now under the roof of the Oracle house following its acquisition of Sun
Microsystems in 2010 (Lohr, 2009).
Image source: http://www.consumerenergyreport.com/wpcontent/uploads/2010/04/smartgrid.jpg
Figure 2.2 – Overview of a generic Smart Grid
What is the future of the RDBMS in the Enterprise?
Page 41
(Image source: EPRI)
Figure 2.3 - ESB proposed implementation of Advanced Metering (Key area of interest is circled)
The Data Volume Problem
A traditional electricity grid is made up of electro-mechanical components that link electricity
generation, transmission and distribution to consumers. A smart grid builds on advanced
digital SCADA devices involving two-way communication of data of interest to utilities,
consumers and government (Financial Times, Nov 2010).
Figures for how much data will flow vary depending on the implementation of smart grid.
Estimates from the ESB’s trials involving 6,500 meters show a substantial increase in the
amount of data required to be stored and analysed at the back end.
Utilities it seems are not immune to ‘Big Data’. Tony Giroti is qualified to comment on the
issue. He is one of only 13 elected members of Gridwise Architecture Council formed by the
US Department of Energy for the purpose of articulating the way forward for intelligent
energy systems.
In his article for the e-magazine Electric Energy Online “You’ve Got the Meter Data – Now
What?”, (2011), Giroti states the data volume problem as such:
What is the future of the RDBMS in the Enterprise?
Page 42
Figure 2.4 – Smart Meters transaction rate
Girotti foresees the storage and processing concerns associated with this volume of data.
Figure 2.5 – Smart Meters data size
Processing of this data also presents a challenge to system architects. Gathering of data from
a million smart-meters at 15-minute intervals as per the example above equates to 1,111
transactions per second, or 90 million transactions per day. The problem is further
compounded by the critical requirement of the system to analyse network event transactions
in real-time in responding to fluctuations in demand and fault response (Giroti, 2011).
One limitation of Girotti’s claim is that there is no indication in the article of how the one
kilobyte per transaction figure is calculated. This is an important factor for vendors of back
end processing running off relational databases. The lower this number is the better. Some
systems rely on filtering out less important data at the source, that is, at the meter itself rather
than storing superfluous data at the back end. For example, meter location information does
not change and can be sent only once. Even at a conservative data size of 128 bytes per
1 Million
Smart
Meters
Hourly Collections of
data =>
3.6Gigabytes of data
per day to be stored,
analysed and backed
up
1Kb per transaction
per meter = 1.1Mbs
1 Million
Smart
Meters
1 read every 15
mins
1 Million meter reads
15 mins x 60 secs
1,111
Transactions
per sec
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick
CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick

Mais conteúdo relacionado

Destaque

Employment Benefits
Employment BenefitsEmployment Benefits
Employment BenefitsMizdean40
 
Olivia Presentation FINAL
Olivia Presentation FINALOlivia Presentation FINAL
Olivia Presentation FINALOlivia Yost
 
Soal dasar snmptn 2012 kode 724
Soal dasar snmptn 2012 kode 724Soal dasar snmptn 2012 kode 724
Soal dasar snmptn 2012 kode 724ulfa marzuqo
 
Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...Miraj Patel
 
Actpresentation 150527162058-lva1-app6892
Actpresentation 150527162058-lva1-app6892Actpresentation 150527162058-lva1-app6892
Actpresentation 150527162058-lva1-app6892sophie167
 
Differences between Q335 and Current Design
Differences between Q335 and Current DesignDifferences between Q335 and Current Design
Differences between Q335 and Current DesignHussein Mohamed, PE
 
Snmptn 2012 ipa(633)
Snmptn 2012  ipa(633)Snmptn 2012  ipa(633)
Snmptn 2012 ipa(633)ulfa marzuqo
 
AUTISM SPECTRUM DISORDER (ASD) Education Environment
AUTISM SPECTRUM DISORDER (ASD)  Education EnvironmentAUTISM SPECTRUM DISORDER (ASD)  Education Environment
AUTISM SPECTRUM DISORDER (ASD) Education EnvironmentHussein Mohamed, PE
 
Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...Miraj Patel
 
Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613
Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613
Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613ulfa marzuqo
 
Projekt ne Gjeografi
Projekt ne GjeografiProjekt ne Gjeografi
Projekt ne GjeografiEnxhu Ng
 
SAJID KN (Anaesthesia Technician cv)
SAJID KN (Anaesthesia Technician cv)SAJID KN (Anaesthesia Technician cv)
SAJID KN (Anaesthesia Technician cv)SAJID KN
 
Heat Treatment Defects and their Remedies
Heat Treatment Defects and their RemediesHeat Treatment Defects and their Remedies
Heat Treatment Defects and their RemediesMiraj Patel
 

Destaque (15)

Employment Benefits
Employment BenefitsEmployment Benefits
Employment Benefits
 
Olivia Presentation FINAL
Olivia Presentation FINALOlivia Presentation FINAL
Olivia Presentation FINAL
 
Soal dasar snmptn 2012 kode 724
Soal dasar snmptn 2012 kode 724Soal dasar snmptn 2012 kode 724
Soal dasar snmptn 2012 kode 724
 
OttoGraham
OttoGrahamOttoGraham
OttoGraham
 
Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...
 
Actpresentation 150527162058-lva1-app6892
Actpresentation 150527162058-lva1-app6892Actpresentation 150527162058-lva1-app6892
Actpresentation 150527162058-lva1-app6892
 
Differences between Q335 and Current Design
Differences between Q335 and Current DesignDifferences between Q335 and Current Design
Differences between Q335 and Current Design
 
Snmptn 2012 ipa(633)
Snmptn 2012  ipa(633)Snmptn 2012  ipa(633)
Snmptn 2012 ipa(633)
 
AUTISM SPECTRUM DISORDER (ASD) Education Environment
AUTISM SPECTRUM DISORDER (ASD)  Education EnvironmentAUTISM SPECTRUM DISORDER (ASD)  Education Environment
AUTISM SPECTRUM DISORDER (ASD) Education Environment
 
Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...Integration of Inventory and the Delivery Model for Operational Efficiency an...
Integration of Inventory and the Delivery Model for Operational Efficiency an...
 
Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613
Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613
Pembahasan soal snmptn 2012 tes potensi akademik (penalaran analitis) kode 613
 
Projekt ne Gjeografi
Projekt ne GjeografiProjekt ne Gjeografi
Projekt ne Gjeografi
 
Farmakologi
FarmakologiFarmakologi
Farmakologi
 
SAJID KN (Anaesthesia Technician cv)
SAJID KN (Anaesthesia Technician cv)SAJID KN (Anaesthesia Technician cv)
SAJID KN (Anaesthesia Technician cv)
 
Heat Treatment Defects and their Remedies
Heat Treatment Defects and their RemediesHeat Treatment Defects and their Remedies
Heat Treatment Defects and their Remedies
 

Semelhante a CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick

BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )Kimberly Brooks
 
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...Tom Robinson
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65Med labbi
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paperjuly12jana
 
CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...
CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...
CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...HarshitParkar6677
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Taniya Fansupkar
 
Intrusion Detection on Public IaaS - Kevin L. Jackson
Intrusion Detection on Public IaaS  - Kevin L. JacksonIntrusion Detection on Public IaaS  - Kevin L. Jackson
Intrusion Detection on Public IaaS - Kevin L. JacksonGovCloud Network
 
Review of big data analytics (bda) architecture trends and analysis
Review of big data analytics (bda) architecture   trends and analysis Review of big data analytics (bda) architecture   trends and analysis
Review of big data analytics (bda) architecture trends and analysis Conference Papers
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesKaran Deep Singh
 
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...Happiest Minds Technologies
 
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Mindshappiestmindstech
 
The future of cloud computing - Jisc Digifest 2016
The future of cloud computing - Jisc Digifest 2016The future of cloud computing - Jisc Digifest 2016
The future of cloud computing - Jisc Digifest 2016Jisc
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its ApplicationsTracy Hill
 
NoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases
NoSQL Object DB & NewSQL Columnar DB, A Tale of Two DatabasesNoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases
NoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases✔ Eric David Benari, PMP
 

Semelhante a CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick (20)

BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
A Global Web Enablement Framework for Small Charities and Voluntary Sector Or...
 
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf6510.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
10.0000@citeseerx.ist.psu.edu@generic 8 a6c4211cf65
 
Dw hk-white paper
Dw hk-white paperDw hk-white paper
Dw hk-white paper
 
Big data
Big dataBig data
Big data
 
Big Data Social Network Analysis
Big Data Social Network AnalysisBig Data Social Network Analysis
Big Data Social Network Analysis
 
CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...
CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...
CUTTING THROUGH THE FOG: UNDERSTANDING THE COMPETITIVE DYNAMICS IN CLOUD COMP...
 
Essay Database
Essay DatabaseEssay Database
Essay Database
 
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
Big data document (basic concepts,3vs,Bigdata vs Smalldata,importance,storage...
 
Intrusion Detection on Public IaaS - Kevin L. Jackson
Intrusion Detection on Public IaaS  - Kevin L. JacksonIntrusion Detection on Public IaaS  - Kevin L. Jackson
Intrusion Detection on Public IaaS - Kevin L. Jackson
 
Review of big data analytics (bda) architecture trends and analysis
Review of big data analytics (bda) architecture   trends and analysis Review of big data analytics (bda) architecture   trends and analysis
Review of big data analytics (bda) architecture trends and analysis
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
Whitepaper: Big Data 101 - Creating Real Value from the Data Lifecycle - Happ...
 
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
Big Data 101 - Creating Real Value from the Data Lifecycle - Happiest Minds
 
The future of cloud computing - Jisc Digifest 2016
The future of cloud computing - Jisc Digifest 2016The future of cloud computing - Jisc Digifest 2016
The future of cloud computing - Jisc Digifest 2016
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its Applications
 
NoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases
NoSQL Object DB & NewSQL Columnar DB, A Tale of Two DatabasesNoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases
NoSQL Object DB & NewSQL Columnar DB, A Tale of Two Databases
 

CS4105_The _Research_Project_Stuart_Clancy_and _Ed_Fitzpatrick

  • 1. What is the future of the RDBMS in the Enterprise? School of Computer Science and Statistics TRINITY COLLEGE What is the future of the RDBMS in the Enterprise? Stuart Clancy Edward Fitzpatrick Degree Year BSc (Hons) Information Systems 11th April 2011
  • 2. A Dissertation submitted to the University of Dublin in partial fulfilment of the requirements for the degree of BSc (Hons) Information Systems Date of Submission: 11th April 2011
  • 3. What is the future of the RDBMS in the Enterprise? - III - Declaration We declare that the work described in this dissertation is, except where otherwise stated, entirely our own work, and has not been submitted as an exercise for a degree at this or any other university. Signed:___________________ Stuart Clancy Date of Submission: Signed:___________________ Edward Fitzpatrick Date of Submission:
  • 4. What is the future of the RDBMS in the Enterprise? - IV - Permission to lend and/or copy We agree that the School of Computer Science and Statistics, Trinity College may lend or copy this dissertation upon request. Signed:___________________ Stuart Clancy Date of Submission: Signed:___________________ Edward Fitzpatrick Date of Submission:
  • 5. What is the future of the RDBMS in the Enterprise? - V - Acknowledgements We would like to acknowledge and thank Ronan Donagher, our project supervisor and Diana Wilson, the acting course director for their support, guidance and understanding throughout our research project. We would also like to acknowledge the unfailing support of our families, who have encouraged us throughout the years of our study; our employers and work colleagues, who have been patient and flexible with working arrangements in order to allow us to complete our studies; and close friends who on occasion are called upon to provide a welcome distraction and perspective. Signed:___________________ Stuart Clancy 11th April 2011 Signed:___________________ Edward Fitzpatrick 11th April 2011
  • 6. What is the future of the RDBMS in the Enterprise? - VI - Abstract Managing data and information has been feature of human activity since the first acknowledged symbols were etched onto stones by Neolithic humans. Since the emergence of the Internet data as an available resource to man and machine has been growing rapidly. This dissertation looks at what this means for the traditional relational database management system (RDBMS). It asks if there is a future for the RDBMS in enterprise information system architecture. It also examines the early developmental years of RDBMS in order to gain an insight as why it has enjoyed relative longevity within a rapidly changing technology environment. New types of database and data management systems are discussed such as NoSQL and other open source non-relational DBMS such as Hadoop and Cassandra. The data volume and data type problem is absorbed into various sections under the umbrella term ‘Big Data’. Utility companies and social networking sites are two sectors where the management of large data volumes is a growing concern are examined in the two case studies. A separate chapter on the research methodology chosen by us is included. It provides the necessary balance between subject matter and method as set out in the initial requirements. Keywords: Relational Theory, DBMS, RDBMS History, NoSQL, Hadoop, Cassandra, Database Market, Big Data, Research Methodology.
  • 7. What is the future of the RDBMS in the Enterprise? - VII - Table of Contents Abstract....................................................................................................................................VI List of Figures...........................................................................................................................X List of Tables.............................................................................................................................X List of Abbreviations...............................................................................................................XI Chapter One - Introduction.................................................................................................... 1 1.1 The Research Question ........................................................................................... 1 1.2 Document Roadmap ................................................................................................ 2 Chapter Two - Literature review, findings and analysis ......................................................... 4 2.1 Introduction............................................................................................................. 4 2.2 RDBMS................................................................................................................... 4 2.2.1 History of the RDBMS ....................................................................................... 10 2.2.2 Main Features of ‘true’ RDBMS......................................................................... 13 2.2.3 IBM, Ellison and the University of California, Berkley....................................... 15 2.3 New Databases ...................................................................................................... 19 2.3.1 Features of NoSQL Databases ............................................................................ 20 2.3.2 Hadoop............................................................................................................... 23 2.3.2.1 Components of Hadoop ................................................................................... 24 2.3.3 Cassandra ........................................................................................................... 25 2.4 The market for RDBMS’ and Non-Relational DBMS’........................................... 27 2.4.1 Introduction........................................................................................................ 27 2.4.2. RDBMS Market................................................................................................. 27 2.4.2.1 Vendor Offerings............................................................................................. 28 2.4.4 Open Source Databases....................................................................................... 32
  • 8. What is the future of the RDBMS in the Enterprise? - VIII - 2.4.4.1 Non-RDBMS Market....................................................................................... 32 2.5 Case Studies .......................................................................................................... 36 2.5.1 Case Study 1- Utility Companies and the Data Management challenge ............... 36 2.5.1.1 Introduction ..................................................................................................... 36 2.5.1.2 Utilities............................................................................................................ 36 2.5.1.3 Smart Grid - The ESB case .............................................................................. 39 2.5.1.4 The Data Volume Problem............................................................................... 41 2.5.1.5 How one utility company is meeting the data volume challenge....................... 44 2.5.1.6 What is the ESB doing? ................................................................................... 45 2.5.1.7 Conclusion....................................................................................................... 46 2.5.2 Case Study 2 - Social Networks – The migration to Non-SQL database models.. 47 2.5.2.1 Facebook Messages ......................................................................................... 48 2.5.2.2 Twitter - The use of NoSQL databases at Twitter............................................. 49 Chapter Three - Research Methodology .............................................................................. 52 3.1 Introduction........................................................................................................... 52 3.2 The strategy adopted for researching the question.................................................. 53 3.3 A Theoretical Framework...................................................................................... 55 3.4 Research Design.................................................................................................... 57 3.5 Methodology - A Qualitative Approach................................................................. 58 3.6 Methods................................................................................................................. 58 3.6.1 Method - Analytic Induction............................................................................... 59 3.6.2 Method - Content Analysis ................................................................................. 59 3.6.3 Method - Historical Research.............................................................................. 59 3.6.4 Method - Case Study........................................................................................... 60 3.6.5 Method - Grounded Theory ................................................................................ 60 3.7 Ethics Approval.................................................................................................... 61 3.8 Audience .............................................................................................................. 61
  • 9. What is the future of the RDBMS in the Enterprise? - IX - 3.9 Significance of research......................................................................................... 61 3.10 Limitations of the research methodology ............................................................. 62 3.11 Conclusion....................................................................................................... 62 Chapter Four - Conclusions, Limitations of Research and Future Work............................... 63 4.1 Introduction........................................................................................................... 63 4.2 Conclusions........................................................................................................... 64 4.2.1 RDBMS.............................................................................................................. 64 4.2.2 New DB’s........................................................................................................... 64 4.2.3 Market................................................................................................................ 65 4.2.4.1 Case Study 1 - Utility Companies .................................................................... 66 4.2.4.2 Case study 2 - Social Networks........................................................................ 66 4.3 Future Research..................................................................................................... 67 4.3.1 NoSQL ............................................................................................................... 67 4.3.2 Case Studies ....................................................................................................... 68 4.3.3 Business Intelligence .......................................................................................... 68 4.3.4 Research Methodology ....................................................................................... 68 4.4 Limitations of the Research ................................................................................... 69 4.5 Final thoughts........................................................................................................ 70 REFERENCES............................................................................................................ 71 APPENDIX 1.............................................................................................................. 85
  • 10. What is the future of the RDBMS in the Enterprise? - X - List of Figures Figure 2.1 - A simplified DBMS .......................................................................................... 9 Figure 2.2 – Overview of a generic Smart Grid ................................................................... 40 Figure 2.3 - ESB proposed implementation of Advanced Metering ..................................... 41 Figure 2.4 – Smart Meters transaction rate…………………………………………………..42 Figure 2.5 – Smart Meters data size………………………………………………………….42 Figure 2.6 - Sources of Smart Grid data with time dependencies……………………………43 List of Tables Table 2.1 - Impact of unstructured data on productivity............................................................8 Table 2.2 – Example of redundant rows in a database.............................................................14 Table 3.1 - Key concepts in Qualitative and Quantitative research methodologies.................54 Table A.1 - Edgar Codd’s original relational model terms…………………………………..85
  • 11. What is the future of the RDBMS in the Enterprise? - XI - List of Abbreviations ACID – Atomicity, Consistency, Isolation and Durability. ACM – Association of Computing Machinery. BA - Business Analytics. BASE - Basically Available, Soft state, Eventual consistency. BI - Business Intelligence. BSD - Berkeley Software Distribution. CA - Computer Associates. CAP - Consistency, Availability and Partition tolerance. CIS – Customer Information System. CODASYL – Conference on Data Systems Language. CRM - Customer Relationship Management. DBMS – Database Management Sys DMS – Distribution Management System. DW- Data Warehousing. ERM - Enterprise Relationship Management. GB - Gigabyte GBT - Google Big Table. GFS - Google File System. GIS - Geographical Information System. HA - High Availability. HDFS - Hadoop Distributed File System. IA – IBM’s Information Architecture.
  • 12. What is the future of the RDBMS in the Enterprise? - XII - IBM - International Business Machines. ISV - Independent Software Vendor. IT – Information Technology. KB - Kilobyte MB - Megabyte MDM - Meter Data Management. MPL - Mozilla Public Licence. MR - MapReduce. NoSQL – ‘No’ SQL or more often ‘Not Only’ SQL. OEM - Original Equipment Manufacturer. OLAP - Online Application Processing. OLTP - Online Transaction Processing. OMS - Outage Management System. OS - Operating System. OSI - Open Source Initiative. PB - Petabyte PDC - Phasor Data Concentrators PLM - Product Life-cycle Management. RDBMS - Relational Database Management System. SCADA - Supervisory Control and Data Acquisition. SOA - Service Oriented Architecture. SQL - Structured Query Language. TB - Terabyte
  • 13. Page 1 Chapter One - Introduction Humans have being storing information outside of the brain probably before the first consistent markings on a bone were found in Bulgaria dating from more than a million years ago. Certainly so since the later Neolithic clay calculi bearing symbols representing quantities, the cave paintings at Lascaux over 17,000 ago; through to the invention of the moveable type printing press and eventually to the first computers. Since the emergence of the information age of the last fifty years or so the amount of data transferred and stored in computers has grown rapidly. Research from the International Data Corp (IDC) in 2008 puts that growth at 60% per annum (The Economist, 2010). An added complexity is that executive strategies now have business intelligence for competitive edge as a key goal. Data management systems that for many years have been the old reliable work horse toiling away in the back end somewhere are once again playing a key role in driving business growth. The question is, are they still capable of carrying out this new and challenging task? This dissertation asks that question and more specifically what is the future for the Relational Database Management System (RDBMS) in the Enterprise? The data volume problem now has a name ‘Big Data’. Its nascence coincides with the growth of the Internet. Alternative solutions to traditional RDBMS to deal with ‘Big Data’ soon followed. Much of these solutions are either based on multi-parallel processing (MPP a.k.a distributed computing) or flipping the row store of RDBMS into column store systems. More recently MPP solutions are being positioned not as alternatives but complements to RDBMS (Stonebraker et al., 2010). Add to this mix a dynamic data management market where vendors are acquiring new technology, merging with each other, adopting open source and creating hybrid stacks in an effort to gain advantage in a market deemed to grow to $32 billion by 2013 (Yuhanna, 2009). 1.1 The Research Question Time was taken to carefully frame our research question so as to provide a clear path of exploration on the subject. The subject could have been framed as a predicated hypothesis such as: “The future for RDBMS in the Enterprise is looking bright” or a contrary statement “The end is nigh for RDBMS”. We chose to frame our research as an open ended question to
  • 14. What is the future of the RDBMS in the Enterprise? Page 2 allow for a broad exploration of the subject with no preconception of the outcome. The broadness of scope however is necessarily tempered by restricting our research to those organisations defined as enterprises. There is difficulty here as there is no overarching definition for an enterprise organisation. However, it is necessary to provide some clear defined boundaries around the term. For this dissertation an enterprise is defined not by size or function alone. Enterprises for us are organisations where the scale of control is large. They include companies with a large amount of customers and employees, as well as companies that control a large infrastructure or several functional units. Enterprises have one top-level strategy to which all other functional units are aligned. The last point is an important characteristic of an enterprise for our dissertation as it applies to decision making for acquiring information management systems. The presence of the word 'future' is central to locating the research in an exploratory and intuitive research domain. It prompts looking into the past in an attempt to explain the present and predict the future. It forces an open mind and questioning approach. It enables the creation of new ideas which are either taken on or set aside for another time. The chapters and sections are set out below in an attempt to follow this map in the view that the journey is the objective rather than the destination. 1.2 Document Roadmap In writing this dissertation a balance was sought between addressing the issues raised by the initial question and the research methodology chosen. The bulk of this dissertation therefore centres on those two areas. In this chapter we introduce the concept of our research and why we feel it is interesting to us. The research question is explained and the objective is put in context. Chapter two contains the literature review. The chapter begins with an outline of RDBMS, its features and history of development. Particular attention is given to the role of IBM in the development of RDBMS. The chapter moves on to discuss new databases and data management systems. A section on the DBMS market follows and presents an overview of the current vendor offerings. The market section does not attempt a comparison of available systems as this work was carried out in greater detail by others more expert than us. Throughout the dissertation we refer the reader to such work where it is not feasible for us to reproduce it.
  • 15. What is the future of the RDBMS in the Enterprise? Page 3 Two case studies are included for the benefit of putting the research question in a practical context. The two areas chosen involve contrasting enterprises. On one hand there is the relatively long established utilities sector and on the other the new phenomenon of social networking and its associated companies. Even though they operate in widely different markets generating different types of data, they both share similar problems when it comes to managing large amounts of data. Likewise, both are trying to get to grips with extracting value out of data for competitive edge. Chapter three discuss the research methodology chosen by us. It deserves a chapter to itself in view of the objective of this dissertation. The chapter begins with an introduction on research theory. It then moves to a discussion on our research strategy. A research framework is introduced as a model of our strategy. The different methodologies available are outlined and our chosen option is explained. Next, a group of related research methods are outlined and the reason for their selection is stated. Short sections on ethics approval, audience and the significance of the research follows before a final section on the limitations of our chosen research methodology closes the chapter. The final chapter attempts to pull together the conclusions and findings from the all the previous sections. Relevant research threads and ideas not covered in sufficient detail in the dissertation are mentioned. The last sections present a summary of the limitations of the overall research and our final concluding thoughts.
  • 16. What is the future of the RDBMS in the Enterprise? Page 4 Chapter Two - Literature review, findings and analysis 2.1 Introduction In this section the focus is on RDBMS. The intention is to provide an overview of its defining features. It is not an in-depth technical analysis of RDBMS and we would refer the reader to better papers on the subject such as those published in the Communications of the Association of Computing Machinery (ACM) of which we refer to several times. It also sets out the background to the development of RDBMS. Within that context an interesting discovery is made with respect to IBM’s initial role in the development of database management systems. For the purpose of exploring the question on the future of RDBMS some associated concepts are discussed such as data types, ‘true’ RDBMS, and whether or not the past can teach us something about the future. 2.2 RDBMS Databases It is unfortunate that in realm of Information Technology (IT) acronyms are not always self- explanatory. Many such acronyms don’t travel outside of their specific domain very well. Take for example DQDB or Distributed Queue Dual Bus; outside of the world of high speed networks this may seem to be a very efficient urban transport vehicle. Luckily the term RDBMS contains within itself the individual components which define it: a system (S) composed of a database (DB) where information is stored by creating relationships (R) between data elements and which can be managed (M) by users. It is helpful at this point to explain the hierarchy, at least, of each of these components. Throughout this dissertation data (and datum-singular) and information are taken to be a classifications of entities stored in a system. Data being lowest in the sense of the taxonomy data – information – knowledge - wisdom (sometimes called understanding) but not lower in
  • 17. What is the future of the RDBMS in the Enterprise? Page 5 real value; a single digit integer may be enough data to invoke the required wisdom to make an important decision. For the purpose of simplicity, data here means a binary entry (such as yes or no, 1 or 0), or a nominal entry (such as dog, 470, Smith, XRA9000 etc.). An analogy from biology might see data as the molecules which make up a cell of information. The word ‘molecules’ is carefully suggested instead of ‘atoms’ given that ‘atomicity’ has particular significance for relational databases. Permitting an extension of the analogy would see a body of knowledge built from the cells of information. It would be unwise to stretch the analogy further to address wisdom. Unhelpfully, the words ‘data’ and ‘information’ are often interchangeable terms in research literature. Some examples of this are the concepts ‘Big Data’ and ‘unstructured data’ for what really ought to be called information. For this reason and for the purpose of consistency this dissertation will hold with the literature and consider the two terms as one except where a distinction is required. A database has been defined in a number of sources as a “collection of related data or information” (Bocij et al. 2006, p. 153; Elmasri and Navathe, 1989, p. 3). The Oxford English dictionary defines a database as a “structured set of data held in a computer” (OED). However, the Cambridge Advanced Learner’s online dictionary (2011) definition is perhaps closer to a contemporary definition: “A large amount of information stored in a computer system in such a way that it can be easily looked at or changed”. It is noted that the definition in the later online edition of the Cambridge (2011) does not have any explicit reference to relational, structured or organised data. This looser definition reflects the changing nature of data management as newer types and bigger volumes of data are being captured. Finally, a definition from the business world which expands on the above mentioning different types of data and hints at the issues regarding scale: A database is “a systematically organized or structured repository of indexed information (usually as a group of linked data files) that allows easy retrieval, updating, analysis, and output of data. Stored usually in a computer, this data could be in the form of graphics,
  • 18. What is the future of the RDBMS in the Enterprise? Page 6 reports, scripts, tables, text, etc., representing almost every kind of information.” (Business Dictionary, 2011). Structured and unstructured data. The last definition above alludes to unstructured data. Unstructured data is data in the form of text (words, messages, symbols, emails, sms texts, reports) or bitmaps (images, graphics). A good example of the growing relevance of unstructured information is a Facebook page containing images, short messages, links, and chunks of text that can be altered at any time. Structured data by contrast is any data “that has an enforced composition to the atomic data types” (Weglarz, 2004). Atomicity is a characteristic of stored entity which is not divisible (Elmasri and Navathe, 1989, p. 41). Atomicity is a key necessity for defining structured data and is what relational databases rely on to make relationships. A database designer can decide on the exact rules for the structured data and the level of atomicity required. As an aside, it is often this small amount of flexibility in the design of the data model which is responsible for the creation of many ‘bad’ databases. Structured data is data that is consistent, unambiguous and conforms to a predefined standard. Structured data will be examined in more detail later under the section discussing RDBMS. A third type is semi-structured data. This is data held in a standard format such as forms, spreadsheets and XML files. This type of data can be parsed by computer programs more easily than unstructured data due to the data generally being located in a fixed and known place, even if the data itself is not atomic. The problem of structured versus unstructured data types can be stated using the example of two schools. One school grades students in the traditional way by giving a numerical grade following examination. Another school does not give numerical grade to students, preferring a method whereby students are furnished with a qualitative report on their overall performance. The former is structured data as the meaning of a grade of 82% is consistent in the context of the schools grading system. It can be easily recorded, measured, and compared to other grades internally or from other schools using the same system. The report format however is unstructured and comparison with a numerical grading system is not so easy. Gleaning relevant information from a text report is complex and involves semantic analysis with or without the help of technology.
  • 19. What is the future of the RDBMS in the Enterprise? Page 7 What does this mean for enterprises? Eighty percent of information relevant to business is unstructured and is mostly textual form (Langseth in Grimes, 2011). Seth Grimes an analytics expert with the Alta Plana Corporation has previously investigated this claim. He concludes that even if the origins of the 80% are elusive (Grimes tracks back as far as the 1990’s) experience supports the claim (Grimes, 2011). Patricia Selinger (IBM and ACM Fellow) who has worked on query optimisation for 27 years puts unstructured data in companies at about 85% (Selinger, 2005). Even assuming a lower figure than 80% for unstructured data in larger enterprises, where much information is in structured forms held in traditional transaction based databases, there is still the problem of how to leverage competitive advantage out of the nuggets of information buried in the rich seams of unstructured data. Businesses are realising that the chances of extracting valuable wisdom from traditional data stores using stale analysis methods and tools are diminishing and that new ideas are needed. Unstructured data is growing faster than structured data, according to the "IDC Enterprise Disk Storage Consumption Model" 2008 report, “while transactional data is projected to grow at a compound annual growth rate (CAGR) of 21.8%, it's far outpaced by a 61.7% CAGR prediction for unstructured data” (Pariseau, 2008). Kevin McIssac (2007) of Computer World magazine puts it into perspective: “Unfortunately business is drowning in unstructured data and does not yet have the applications to transform that data into information and knowledge. As a result staff productivity around unstructured data is still relatively low.” McIssac gives examples of the impact of unstructured data on productivity citing research from various sources. Table 2.1 below summarises those impacts:
  • 20. What is the future of the RDBMS in the Enterprise? Page 8 Time/Volume Impacts on Research Source 9.5 hours per week Average time an office worker spends searching, gathering and analysing information (60% of that on the Internet) Outsell 10% of working time Time professionals in creative industry spend on file management. GISTICS 600 e-mails per week Sent and received by a typical business person. Ferris Research 49 minutes per day Time an office worker spends managing e- mail. Longer for middle and upper management. ePolicy Institute Table 2.1 - Impact of unstructured data on productivity. Where are the joins? It seems that a reappraisal of what a database is or needs to do is well under way. If this is so, then this reappraisal logically extends to the database management system. Structured data can be joined to other structured data to form concatenations of information using a query language based on mathematical operations. Things get a little more ‘fuzzy’ with unstructured data. Stock market analysts might like to try querying an online media sources for all posts where the word ‘oil’ is used but only in the context of the recent crises in Libya. How unstructured and unrelated data is to be stored in the system and how meaningful information can be retrieved back out of that same system are questions many organisations are now asking – but, similar questions were asked before and the past may hold some lessons for us.
  • 21. What is the future of the RDBMS in the Enterprise? Page 9 A DBMS In its simplest definition a DBMS is a set of computer programs that allows users to create and maintain a database (Elmasri & Navanthe, 1989 p. 4). Bocij et al. (2006, p. 154) expands on this definition a little: “One or more computer programs that allow users to enter, store, organise, manipulate and retrieve data from a database.” (Source: Elmasri and Navathe, 1989 p. 5) Figure 2.1 - A simplified DBMS Figure 1 above shows the key components of a data management system. A detailed description of each of the components of the system is not necessary for our purpose but briefly they are: • Application programs with which users can interact with the stored data. • Software programs for processing and accessing the stored data. • A high-level declarative language interface for executing commands (commonly known as a query language).
  • 22. What is the future of the RDBMS in the Enterprise? Page 10 • A repository for storing data. • A store of information related to the data for classifying or indexing purposes (meta- data) • Hardware suitable for each of the above functions • Users (includes database administrators and designers) 2.2.1 History of the RDBMS To understand why newer types of databases and data management systems are emerging and taking hold it seems reasonable to explore why RDBMS’ came into existence, as well as their usefulness and relative longevity. The 1960’s BC (Before Codd) Data management systems existed before Edgar Codd, while at IBM, wrote his seminal paper published in 1970 called “A Relational Model of Data for Large Shared Data Banks”. Codd’s paper presented a new database model and hence introduced the world of database management to relational theory (Codd, 1970). In his paper Codd discusses the limitations of the existing hierarchal and network data systems and introduces a query language based on relational algebra and predicate calculus. In a later important paper he described 12 rules for a relational database management system (Codd, 1985). Systems that satisfy all 12 rules are rare. In fact, it is argued that no truly relational database systems existed in wide commercial production even a decade after Codd’s vision (Don Heitzmann in Thiel, 1982), and even up to more recently (Anthes, 2010). A brief description of the two data management systems (of whose limitations Codd addressed) is a useful precursor to a broader description of relational DBMS’. Hierarchal Data Models Hierarchal data models are similar to tree-structured file systems in that the data is stored as parent-child relationship. Codd asserts that hierarchal and network based DBMS’ were not data models in comparison to his more formalised Relational model. (Codd, 1991). For simplicity the word ‘model’ is maintained for the data structure of all systems under
  • 23. What is the future of the RDBMS in the Enterprise? Page 11 discussion here. The model made sense to organisations that were naturally hierarchal in nature - a legacy of Henri Fayol and his 14 management principles, popular in the 1960’s and still used in organisations today (Stoner and Freeman, 1989; Tiernan et al., 2006). A hierarchal data model can be presented as a tree-structure of parent-child relationships or as an adjancy list. For example: a root entity with no parent might be SCHOOL; STUDENT is a child of SCHOOL; GRADE is a child of STUDENT. STUDENT is also a child of COURSE. In this type of structure data can be replicated many times in different branches of the tree, a relationship of ‘one to many’ or 1:N. A ‘modified preorder tree traversal' algorithm is used to number each entity on the way down through the tree-structure (left value) and again on the way back up to the root (right value). Thus, making the query operations more efficient in navigating around the data (Van Tulder, 2003). The first hierarchal DBMS was developed by IBM and North American Aviation in the late 1960’s (Elmasri and Navathe, 1989 p. 278). IBM imaginatively called it Information Management System (IMS) and Frank Hayes dates its roll out to 1968 (Hayes, 2002). Elmasri and Navathe cite McGee (1977) for a good overview of IMS (1989, p. 278). Network Data Models As can be seen in the hierarchal data model above a child could have many parents. A STUDENT for instance, can take more than one MODULE in any COURSE YEAR. In a hierarchal structure the same STUDENT would appear under each of the MODULE trees. In other words many students can take many modules. The Network data model was a further development of the hierarchal model to address the issue of managing ‘many to many’ (M:N) relationships. The Conference on Data Systems Languages (CODASYL) defined the network model in 1971 (Elmasri and Navathe, 1989). Where the underlying principle of the hierarchal model was parent-child tree structures, in a network model it is set theory. Records are classified into record types and given names. These records are sets of related data. Record types are akin to tables in a relational database model. The intricacies of set theory are beyond the scope of this dissertation; however, it suffices to say that complex data combinations can be achieved by nesting record types within other record types – data sets as members of other data sets. If this were possible in a relational database it would be like having tables within tables within tables.
  • 24. What is the future of the RDBMS in the Enterprise? Page 12 The earliest work on a network data model was carried out by Charles Bachman in 1961 while working for General Electric. His work resulted in the first commercial DBMS called Integrated Data Store (IDS) which ran on IBM mainframes. The system was cumbersome and was eventually redeveloped by an IDS customer, BF Goodrich Chemical Company into what was called IDMS (Hayes, 2002). With Bachman on board as a consultant, IDMS was eventually commercialised by Cullinane/Cullinet Software in the 1980’s. Cullinet was bought by Computer Associates (CA) in 1989. IDMS is a current offering by CA for mainframe database management systems today. Charles Bachman received the Turing Award in 1973 for his pioneering work in developing the first commercially available data management system, for being one of the founders of CODYSYL and for his work on representation methods for data structures (Canning in Bachmann, 1973). The 1970’s Adabas DBMS was developed in the 1970 by Software AG. It has an interesting feature of relevance to this dissertation. Adabas was designed to run on mainframes for enterprises with large data sets and requiring fast response times for multiple users. One of its main features is that it indexes data using inverted-list type indexing. Adabas also features a data storage address convertor which avoids data fragmentation. Data fragmentation can occur when a record is updated with additional data. The record is now too large to be stored in the original location. The data can be moved to a new location but the indexes still expect the data to be in the same place so they also have to be updated. The address convertor does this. The alternative as used by other systems is data fragmentation; part of the data is stored in the original location with a pointer to where the remainder is stored. Fragmentation and pointer methods however require additional processing and hence slower response times. The problem of using pointers in systems predating RDBMS instead of storing data directly (in tuples as is done in RDBMS) is referred to by IBM’s Irv Traiger (in McJones, 1997 pp. 16-17). According to Carl Monash, Adabas’ inverted-list indexing is the favoured method for searching textual content. New ideas regarding the management of text (unstructured data) has according to Monash “at least the potential of being retrofitted to ADABAS, should the payoff be sufficiently high” (Monash, Dec 8 2007).
  • 25. What is the future of the RDBMS in the Enterprise? Page 13 Edgar Codd and the birth of the Relational Model Codd’s text ‘The Relational Model for Database Management’ of 1990 (version 2, 1991) brings together his ideas set out in his previous papers regarding Relational Data Model for managing databases. In it he places his model as solidly based on two areas of mathematics: Predicate Logic and Relational Theory. In order for the maths to work effectively, there are four essential concepts associated with the relational model: domains, primary keys, foreign keys and no duplicate rows. In particular, the importance of Domains has not been understood fully or adopted by later commercial versions of his RDBMS (Codd 1991, pg18). Also, two early prototypes IBM’s System R and Berkley University and Michael Stonebraker’s INGRES were not concerned about the need to address the issue of duplicated rows. The designers of both those systems felt that the additional processing required to eliminate duplicate rows was unnecessary given the relative benign presence of duplicate rows (Codd, 1991, p. 18). Codd’s purer model based on mathematic principles gave way to the more pragmatic needs of the commercial world. 2.2.2 Main Features of ‘true’ RDBMS The main features of a Relational DBMS as proposed by Codd, distinguishes a ‘true’ Relational DBMS from other DBMS’. Based on his earlier paper setting out his 12 Rules (1985), they are summarised as follows: • Database information is values only and ordering is not essential (meta data while required should not be of concern to the everyday user; pointers are not used) • Data management is not dependant on position within the structure (contrast with Hierarchal and Network models). • Duplicate rows are not allowed. • Information should be capable of being moved without impact on the user. • Three level architecture of the RDBMS – base relations, storage, views (derived tables). • Declarations of domains as extended data types.
  • 26. What is the future of the RDBMS in the Enterprise? Page 14 • Column description should be akin to the domain it belongs to (i.e. a good naming convention). • Each base relation (R-Table) should have one and only one primary key column, where null value entries are not allowed. • RDBMS must allow one or more columns to be assigned as foreign keys. • Relationships are based on comparing values from common domains. This last point is crucial to understanding Codd’s intention. Only values from common domains can be properly compared – currency with currency, euro with euro, date with date, integer with integer etc. The basis for this lies with the nature of the mathematical operators used in the system. Consistency of data types and strict rules are therefore vital for the effective operation of the system. Herein lays one of the difficulties presented to designers of commercial versions of Codd’s RDBMS. Users of data management systems are presented with real world scenarios where consistency is not always practical. It would be ridiculous to ask members of a social networking site to use standard forms for communicating so that the DBMS could store the relevant information appropriately. Even closer to the relational database world a transaction record could be created for a person called William Thomas as follows: Instance Surname Forename Address DOB ID Order No 1 Thomas William 22, Greenview Street 12/06/1945 1234 104 2 Thomas Bill 22 Greenview St. 12/06/1945 1365 104 3 Thomas William H. 22, Greenview Street 12/06/1945 3456 104 Table 2.2 – Example of redundant rows in a database As can be seen in this simple example above, the database treats these as three distinct and unique records, even though the intention is that only one record for this person should exist. The result impacts on the size, processing speed and integrity of the system. Techniques to address such problems (primarily data normalisation) were developed almost from the beginning, in the early 1970’s by Codd and later by Raymond Boyce and Codd (Elmasri and
  • 27. What is the future of the RDBMS in the Enterprise? Page 15 Navathe, 1989, p. 371). Database normalisation is beyond the scope of this dissertation, however the salient point and (and the reason for our initial hypothesis) is that the nature and amount of unstructured data flowing in the electronic ether has pushed RDBMS and its associated control and optimisation processes to the limits of their capabilities. Debashish Ghosh of Anshin Software while advocating the merits of non-relational models nevertheless puts it fairly… “A relational data management system (RDBMS) engine is the right tool for handling relational data used in transactions requiring atomicity, consistency, isolation, and durability (ACID). However, an RDBMS isn’t an ideal platform for modelling complicated social data networks that involve huge volumes, network partitioning, and replication”. (Ghosh, 2010) The above discussion is intended to provide an important distinction between Edgar Codd’s original theory of a relational data management system and subsequent versions developed for the commercial enterprise market (mainframe computer market at that time). The importance of the mathematical principles (Relational Algebra and Calculus) behind Codd’s ideas are not underestimated, nor are the associated operations based upon those principles, in fact they are key to understanding why Codd at the time persisted in pushing for a full and true implementation of his model, and it may also explain also why he stepped back from the first experiments in commercialising his ideas (Chamberlin and Blasgen in McJones, 1997 p. 13). Brevity here forces us to move on to look at two of the earliest commercial versions of RDBMS that by no accident are also the two market leaders today. As an aside, Appendix 1 presents of useful comparison of the key terms from Codd’s original intended meaning and their relationship to other systems. 2.2.3 IBM, Ellison and the University of California, Berkley IBM One artefact cited several times in this section on the history of data management systems is a transcript from a reunion meeting in 1995 of some of the original IBM research employees, who during the 1970s and 1980s were at the coal face of data management development. The article edited by Paul McJones is entitled “The 1995 SQL Reunion: People, Projects, and
  • 28. What is the future of the RDBMS in the Enterprise? Page 16 Politics” (McJones, 1997). At first, what seems like the convivial reminiscences of middle aged ex IBM colleagues, in fact turns out to be a rather more interesting illumination of the context around the timelines for the development of some of the most important ideas to emerge, as well as the historically important players and products from the realm of database management. Some of the key people attending the reunion and contributing to the discussion are: Donald Chamberlin, Jim Gray, Raymond Lorie, Gianfranco Putzolu, Patricia Selinger, and Irving Traiger. All are IBM and ACM Fellows and award winners for their work. Jim Gray, fellow Berkley graduate and mentor to Michael Stonebraker was given the ACM Turing Award in 1998 for his work on transaction processing (ACID) (Stonebraker, 2008). Patricia Selinger was awarded the ACM Edgar Codd Innovation Award for her work in query optimisation. Their contributions were vital to the features of commercial RDBMS which has ensured its longevity thus far and possibly for many years yet. IBM and System R Midway through the 1970s IBM’s San Jose based research lab began working on a project called System R. Like many IBM research projects at the time it came out of different task groups working on related areas such as data language, data storage, optimisation, concurrent users, and system recovery. System R was relational based and combined work from various groups. System R as a commercial RDBMS was installed in Prat & Whitney Aircraft Company in Hartford Connecticut in 1977 where it was used for inventory control. However, IBM was not yet interested in releasing it as fully featured product. At that time the big IBM cash cow was IMS (its mainframe Network model DBMS mentioned earlier). And the research focus was on a project called Eagle – a replacement for IMS with all the new features of recent discoveries. With the pressure off, the System R developers plugged away, aiming it towards the lower midrange product line (Jolls in McJones, 1997, p. 31). Two things happened at the time which resulted in the focus coming back on System R and getting it ready for market (McJones, 1997, pgs 33-34). Firstly, IBM was starting to loose ground to new mini computers (Gray in McJones, 1997, pg 20) and secondly the Eagle project was hitting a wall. System R unlike Eagle was relational and already pitched towards the smaller computer range. The System R star did not shine for long and it was replaced by DB2 with Release 1 in 1980. IBM fully embraced relational DBMS with Release 2 around 1985 (Miller in McJones 1997, p. 43). DB2 is IBM’s current offering and is mentioned again under the section on the RDBMS market.
  • 29. What is the future of the RDBMS in the Enterprise? Page 17 The Birth of SQL In and around the same time that System R was being developed, the language research team at IBM, Relational Data Systems (RDS) took on Codd’s two mathematical based languages for data management, relational algebra and relational calculus. By their own admission they found these mathematical notations too abstract and complex for general use. They developed a notation which they called SQUARE (Specifying Queries as Relational Expressions), (Chamberlin in McJones et al., 1997 p. 11) SQUARE had some odd subscripts so a regular keyboard could not be used. RDS further developed it to be closer to common English words. They called the new version Structured English Query Language or SEQUEL. The intention was to make interaction with databases easier for non-programmers. However its biggest impact came later when Larry Ellison (co- founder and CEO of Oracle) read the IBM published papers on SEQUEL and realised that this query language could act as an intermediary between different systems (Chamberlin in McJones et al., 1997 p. 15). It was the RDS team at IBM who renamed it to SQL following a trademark challenge to the term SEQUEL from an aircraft company (McJones et al, 1997, p. 20) INGRES In parallel with the work going on at IBM, the University of California at Berkley had a project developing a system called INGRES (short for Interactive Graphics Retrieval System). Michael Stonebraker who was at Berkley in 1972 was developing a query language called QUELL. Stonebraker knew fellow Berkley graduates at IBM San Jose and more importantly knew of their work. INGRES used QUELL whereas IBM and Larry Ellison’s project at Software Development Laboratories (later Oracle) used SQL. Subsequent off spring of the INGRES family are Sybase and Postgre (post Ingres). Incidentally, Microsoft struck a deal with Sybase to use their code for their new extended operating system. Recalling that the Sybase people were brought up in the QUELL tradition under Stonebraker, Microsoft preferred SQL. They eventually fell out and Microsoft who now owned the Sybase code ended up developing Microsoft SQL Server (Gray in McJones, 1997 p. 56).
  • 30. What is the future of the RDBMS in the Enterprise? Page 18 Oracle In 1977 Larry Ellison, Bob Miner and Ed Oates founded Software Development Laboratories (SDL), the precursor to Oracle Corporation. SDL based its system on a technical paper in an IBM journal (Oracle History, 2011). That was Edgar Codd’s 1970 seminal paper setting out his model for a RDBMS (Traiger in McJones et al., 1997). SDL’s first contract was to develop a database management system for the Central Intelligence Agency (CIA) - the project was called ‘Oracle’. SDL finished that project a year early and used the time to develop a commercial RDBMS putting together the work done by IBM research on relational databases and as mentioned above another project on working on the query language called SEQUEL. While Ellison and SDL benefited from the work done at IBM they still had to do all the coding. The resulting product was faster and a lot smaller than IBM’s System R. The first officially released version of Oracle was version 2 in 1979. Brad Wade jokes about Edgar Codd’s influence on Oracle - on Codd being made an IBM Fellow in 1976, “It’s the first time that I recall of someone being made an IBM Fellow for someone else’s product” (Wade in McJones, 1997, pg 49.) It appears that many new enterprises sprang from the well of knowledge existing at IBM during the 1970’s and 1980’s. Had the IBM research units not had so much talent, nor not allowed publication of key papers at the time, the database world might look very different today. Patents on software were prohibited by IBM, and also in fact by Supreme Court law until 1980 (Bocchino, 1995). According to Franco Putzolu, IBM Research at that time and up until 1979 were “publishing everything that would come to mind” (in McJones, 1997, p. 16). Mike Blasgen argues that the outside interest in the published research was one reason why the corporate machine of IBM began to notice some of the lesser research projects (in McJones, 1997 p. 16). It is hoped that the above overview gives the reader some understanding of the related threads that developed out of Charles Bachman’s initial work on data management systems, through IBM via Edgar Codd and out into the wide world via IBM research department’s open attitude to sharing knowledge, of which Larry Ellison’s Oracle benefited greatly. Berkley played its role also in the providing a common alma mater for young enthusiastic developers to discuss ideas. It is an interesting irony that when we think of ‘open source’ we envision a
  • 31. What is the future of the RDBMS in the Enterprise? Page 19 recent phenomenon, however, IBM during the 1970’s would appear to have been a little more open, for whatever reasons, than is usually accredited to them. 2.3 New Databases This section will explore the development of new DB’s that have emerged on the database market over the past decade, and what impact these DB’s will have on the general database market as a whole. What are ‘New DB’s’? Traditional databases rely on a relational model in order to function. That is, they follow a set of rigid rules to ensure the integrity of the data in the database. Most RDBMS models follow the set of rules, originally outlined by Edgar Codd (1970). New NoSQL database models don’t follow all of the rules set down by Codd. While RDBMS’ models follow the set of properties called ACID as previously stated, NoSQL database models do not. They follow any number of database properties including BASE (Basically Available, Soft state, Eventual consistency) (Cattell, 2011) and CAP (Consistency, Availability and Partition tolerance). Why the development of NoSQL model databases? Development of NoSQL databases was as a result of the evolution of the World Wide Web, and the desire of individuals and companies/organizations to generate data, large amounts of it (White, 2010, p. 2). By collecting data, organizations then had extract value from that data in order to be successful in whatever field they participated, in the future. The problem organizations faced in extracting value from that data were twofold: 1. As storage capacities increased, the means of transferring the data to the drive(s) did not keep up. Twenty years ago, a hard drive could store 1.3 GB of data, while the speed at which the entirety of the data could be accessed was 4.4 MB per second; about five minutes to access it all. Today, 1 TB hard drives are the norm, but access
  • 32. What is the future of the RDBMS in the Enterprise? Page 20 speeds are about 100 MB per second; an access speed decrease of a factor of 30 (White, 2010, p. 3). A means of getting around this bottle neck was the introduction of disk arrays, whereby data could written and read from multiple disks in parallel. The drawback to this was the possibility of hardware failure, whereby a disk or machine would fail and the data lost (White, 2010, p. 3). Redundancy (various options of RAID being the most famous examples) solved some of these problems but not all (Patterson, 1988). 2. The second problem is that with multiple disks, relational database models, with their inbuilt consistency requirements, are unable to access data quickly enough when the data is spread across multiple disk drives. RDBMS systems may not be able to allow a query to access certain data if that data is already in use by another program or user (Chamberlin, 1976). 2.3.1 Features of NoSQL Databases In order for a Database to be considered a NoSQL database, it first must not comply with the entirety of ACID properties. Amongst the features that define NoSQL databases include Scalability, Eventual Consistency and Low Latency (Dimitrov, 2010). A key feature of NoSQL databases is a “shared-nothing” architecture. This means databases can replicate and partition data across multiple servers. In turn, this allows the databases to support a large number of simple read/write operations per second (Cattell, 2011). Scalability With traditional RDBMS systems, a database was usually required to scale up, that is, switch over to a newer, larger capacity machine, if the database is to expand capacity (Cattell, 2011). One of the features designed into some NoSQL databases is their ability to scale to large data volumes without losing the integrity of the data. With NoSQL, as systems are required to expand with an influx of additional data, they scale out by adding more machines to the data
  • 33. What is the future of the RDBMS in the Enterprise? Page 21 cluster. With this scaling, NoSQL systems can process data at a faster speed than RDBMS, as they are capable of spreading the workload of the processing over numerous machines (Cattell, 2011). Eventual Consistency Eventual Consistency was pioneered by Amazon using the Dynamo database. The purpose of its introduction was to ensure High Availability (HA) and scalability of the data. Ultimately, data that is fetched for a query is not guaranteed to be up-to-date, but all updates to the data are guaranteed to be propagated to all copies of the data on all nodes of the cluster eventually (Cattell, 2011). This ensures that databases are accessible to programs and individuals whom wish to read or modify data, without the constraints of being locked out of a database or data field while the data is currently being updated or read, as is the case with RDBMS databases models. Low Latency Latency is an element of the speed of a network. It refers to any number of delays that typically occur in the processing of data (Mitchell, no date). In the case of NoSQL databases, it means that queries can access the data and return answers more quickly than RDBMS because the data is distributed across multiple nodes of a cluster, instead of one machine. This results in a faster response time. Causes for high latency in traditional RDBMS model databases include the seek time of hard disks (Mitchell, no date), the speed of the network cables that run on the machines, and the bad programming of queries (Stevens, 2004) (Souders, 2009). NoSQL database models Unlike RDBMS models, NoSQL data models are often inconsistent. For storage purposes, NoSQL databases have a number of data model categories, which are listed below: Key-value Stores Databases that have this model use a single key-value index for all the data. These systems provide persistence mechanisms as well as additional functions such as replication, locking,
  • 34. What is the future of the RDBMS in the Enterprise? Page 22 transactions and sorting. NoSQL databases such as Voldemort and Riak use Multi-Version Concurrency Control (MVCC) for updates. They update data asynchronously, so they cannot guarantee consistent data (Cattell, 2011). Key-value store databases can support traditional SQL functionality, such as the ability to delete, insert and lookup operations (Cattell, 2011). Document Stores This model supports more complex data than key-value stores. They can support secondary indexes and multiple types of documents per database. A number of database models using this include Amazon’s SimpleDB and CouchDB Document Store databases provide a querying mechanism for the data they contain using multiple attribute values and constraints (Cattell, 2011). Extensible Record Stores Influenced by Google’s Bigtable, Extensible Record Store databases consist of rows and columns, which are scaled across multiple nodes. Rows are split across nodes by ‘sharding’ the primary key. This means that querying a range of values does not have to go to every node. Columns are distributed over multiple nodes by using ‘column groups’. This allows the database customer to specify which columns are best stored together, which has the added advantage of being able to be queried faster, as all the most appropriate data for a query is most likely close at hand: e.g., name and address (Cattell, 2011). The most famous examples of an Extensible Record Store database available, save Google’s proprietary Bigtable, are HBase and Cassandra. Additional databases that use the model are Hypertable, sponsored by Baidu (Hypertable, 2011), and PNUT (Yahoo Research, 2011).
  • 35. What is the future of the RDBMS in the Enterprise? Page 23 Graph Databases A graph database maintains one single structure – a graph (Rodrieguez, 2010). A graph is a flexible data structure that allows for a more agile and rapid style of development (Neo4J, 2011). A graph database has three main attributes: 1. Node – the location of the machine in which the data is stored 2. Relationship – this is a label given to the data item, which determines which data in the same or other node that the original data is related too. 3. Property – this is the attribute of the data. (Neubauer, 2010) The purpose of graph databases is to quickly determine the relationships between different items of data. Examples of graph databases include the Neo4j database and Twitter’s FlockDB, which is used to join up the tweets between those who post them and all of their followers (Weil, 2010). 2.3.2 Hadoop Hadoop/MapReduce Hadoop is a distributed database model originally developed by Doug Cutting at Yahoo (White, 2010, p. 9), using Google’s proprietary Bigtable database as a model (Apache, 2011). Throughout its short history, developers have added components that allow Hadoop to process the data that it collects more efficiently Hadoop contains a number of components that allow the system to scale to large clusters of machines, without impacting the overall integrity of the data stored on those machines. The main component of Hadoop is MapReduce. MapReduce is a framework for processing large datasets that are distributed across multiple nodes/servers. The ‘map’ part of the framework takes the original inputted data and partitions the data, distributing the original input to different nodes. The individual nodes can then, if necessary, redistribute the data again to other sub-nodes. MapReduce then applies the map
  • 36. What is the future of the RDBMS in the Enterprise? Page 24 function in parallel to every item in the dataset, producing a list of pairs for each query (White, 2010, p. 19). The ‘reduce’ part of the framework then collects all of the common key values, sums them up, and returns a single output for the keys and a value(s). The reduce function, in effect, removes duplication within the system, allowing queries to return results more speedily (White, 2010, p. 19). Hadoop is designed for distributed data, with a dataset split between multiple nodes, if necessary. If MapReduce must query data that is located on multiple nodes, then the map function will map all the data for the query that is located on a single node, and return the result. It will do the same query on all nodes that the relevant data is located on. The reduce function will then take all those map results and reduce them down to single values, again to return the query result(s) (White, 2010, p. 31). Both functions are oblivious to the size of the dataset that they are working on. As such, they can remain the same irrespective of the size of the dataset, large or small. Additionally, if you double the input data, the job will run twice as slow; however, if you double the size of the cluster, a job will run as fast as the original one (White, 2010, p. 6). HDFS HDFS is the file system that allows Hadoop to distribute data across multiple nodes/machines. HDFS stores data in blocks, similar in fashion other file systems. However, while other file systems have small sized blocks, HDFS, by default has large size blocks. This is to reduce the number of seeks that Hadoop must make in order to return a query, speeding up the process (White, 2010, p. 43). 2.3.2.1 Components of Hadoop HBase Based on Google’s Bigtable, HBase was developed by Chad Walters and Jim Kellerman at Powerset. The purpose of the development of HBase was to give Hadoop a means of storing large quantities of fault-tolerant data. It can also sit on top of Amazon’s Simple Storage Service (S3) (Wilson, 2009). HBase was developed from the ground up to allow databases to
  • 37. What is the future of the RDBMS in the Enterprise? Page 25 scale just by adding more nodes – machines – to the cluster that HBase/Hadoop is installed on. As it does not support SQL, it can do what an RDBMS database cannot; host data on sparsely populated tables, located on clusters made from commodity hardware (White, 2010, p. 411). The structure of HBase is designed with a ‘master node’, which has control of any number of ‘slave nodes’, called Region Servers. The master node is responsible for assigning regions of the data to the region servers, as well as being responsible for the recovery of data in the event of a region server failing (White, 2010, p. 413). In addition to this setup, HBase is designed with fault tolerance built in – HBase, thanks to HDFS, creates three different copies of the data spread across different data nodes (Dimitrov, 2010). Hive Hive is a scalable data processing platform developed by Jeff Hammerbacher at Facebook (White, 2010, p. 365). The purpose of Hive is to allow individuals whom have strong SQL skills to run queries on data that is stored in HDFS. When querying the dataset, Hive first tries to convert SQL queries into MapReduce jobs, as well as custom commands that allow it to target different partitions within the HDFS dataset, allowing users to query specific data within the Hadoop cluster (White, 2010, p. 514). This allows Hive to provide users with a traditional query model from older RDBMS environments within the newer distributed NoSQL database environments. 2.3.3 Cassandra Cassandra is a fault tolerant, decentralised database that can be scaled and distributed across multiple nodes (Apache, 2011 - Lakshman, 2008). Developed by Avinash Lakshman at Facebook (Lakshman, 2008), Cassandra is now an open source project run by the Apache Foundation (Apache, 2011). Initially designed to solve a search indexing problem, Cassandra was designed to scale to very large sizes across multiple commodity servers. Additionally, the ability to have no single point of failure was built into the system (Lakshman, 2008). Since Cassandra was designed to scale across multiple servers, it had to overcome the possibility of failure at any given location within each server, such as the possibility of a drive failure.
  • 38. What is the future of the RDBMS in the Enterprise? Page 26 To guard against such a possibility, Cassandra was developed with the following functions: Replication Cassandra replicates data across different nodes when written too. When data is requested, the system accesses the closest node that contains the data. This ensures that data stored using Cassandra maintains High-Availability (HA), one of the core attributes of a NoSQL database. Once data is written to a server, a duplicate copy of the data is then written to another node within the database (Lakshman, 2008). Eventual Consistency Cassandra uses BASE to determine the consistency of the database. In order for data to be accessible to users, an individual whom is reading the data accesses it on one node. At the same time, another individual can be making changes to another copy of the data on another node. As the data is replicated, newer versions of the data are sitting on one node, while older versions are still active on other nodes (Apache wiki, 2011). Users of Cassandra can also determine the level of consistency, allowing writes to add or edit data to a single copy of the data in a node, or, if possible, to write to all copies of the data across all nodes (Apache wiki, 2011). Scalability Data that is stored on Cassandra is scalable across multiple machines. Such elasticity is possible because Cassandra allows the adding of additional machines to the cluster when required (Apache, 2011).
  • 39. What is the future of the RDBMS in the Enterprise? Page 27 2.4 The market for RDBMS’ and Non-Relational DBMS’ 2.4.1 Introduction This section is to give an overview of the current market for both relational databases and newer non-relational databases. This document will investigate both traditional vendor database offerings as well as the proliferation over the past few years of a number of community developed open source database offerings. The Literature Review for determining the current market for both traditional relational databases and ‘future’ non-relational databases utilised a variety of sources, including Internet search queries to find relevant research material, as well as utilising the University of Dublin (DU) library facilities to access academic and commercial research to which DU has access to. 2.4.2. RDBMS Market Today, many executives want business to grow based on data-driven decisions. As such, analytics of data has become a valuable tool in Business Intelligence (BI). Many of the top performing companies use analytics to formulate future strategies and guide them on the implementation of day-to-day operations (LaValle et al, 2010). However, organisations are gaining more and more data without the means of extracting value from that data (LaValle, Hopkins, et al). This has resulted in a requirement for the adoption by companies of enterprise solutions that can give an overview of the data being generated, using Online Analytical Processing (OLAP) databases. The Database Management Systems market is split into two segments; OnLine Transaction Processing (OLTP) and OLAP / Data Warehousing (DW). OLTP systems are characterised by the RDBMS options available from vendors in the market will generally target either of these two segments. The OLTP market targets clients that require fast query processing, maintaining of data integrity in multi-access environments and a business model that has data measured by the number of transactions per second that the database can handle. In an OLTP model database,
  • 40. What is the future of the RDBMS in the Enterprise? Page 28 there is an emphasis on detailed current data, with the schema to store the data being the entity model BCNF ( Datawarehouse4u, 2009). OLAP databases are characterised by a low volume of transactions, and are primarily designed for data warehousing databases. As such, they are particularly useful for data mining; whereby applications access the data to give an overview of current trends, business performance and informational advantage. As such, OLAP databases are increasingly seen as important for making Business Intelligence (BI) decisions (Feinberg, Beyer, 2010) 2.4.2.1 Vendor Offerings Within the enterprise database market, the industry is dominated by a few big corporations which include Oracle, IBM, Microsoft, Sybase and Teradata. Many of the database offerings from these firms operate in the Data Warehousing sector, which contains most of the market for enterprise database management systems. While the big players will have comprehensive database offerings for their clients, the market is currently being disrupted by new entrants whom are targeting niche areas, either focusing on performance issues related to their offerings, or single-point offerings (Feinberg, Bayer, 2010). Oracle According to Gartner, Oracle is currently the No. 1 vendor of RDBMS’ worldwide (Gartner in Graham et al, 2010), with a 50% share of the market for the year 2010 (Trefis, 2011). They are forecast to improve this figure to 60% by 2016, driven by their sales of the Exadata hardware platform. Leveraging the use of the high-end Exadata servers in conjunction with Oracles’ database software is estimated to result in more efficient and faster Online Transaction Processing (Graham et al, 2010). Currently, Oracle generates 86% of revenues from its database software portfolio, with 8% from its hardware portfolio. The future strategy of the company is to have clients purchase complete systems – hardware and software – thus leveraging the power of the Exadata system to get the most out of Oracle’s database technology. The result will be an increase in Oracle’s revenues and its market share (Crane et al, 2011).
  • 41. What is the future of the RDBMS in the Enterprise? Page 29 IBM IBM is one of the main vendors in the market, and is the only vendor that offers to its clients an Information Architecture (IA) that spans all systems, which includes OLTP, DW, and retirement of data (Optim tapes) (Henschen, 2011a). IBM’s main offering in the RDBMS market is the DB2 database. DB2 runs on a number of platforms, including Unix, Linux and Windows OS. DB2 can also run on the z/OS platform, where it is used to deploy applications for SOA, CRM, DW and operational BI. IBM’s RDBMS solutions are ranked no.2 behind Oracle worldwide (Finkle, 2008), however, they are slowly losing market share to Microsoft and Oracle due to uncompetitive pricing for their database as well as greater functionality that can be found from rival offerings. Recently, IBM acquired Netezza (Evans, 2011), a company who provide a DW appliance called TwinFin to clients. TwinFin is a purpose-built appliance that integrates servers, storage and database into a single managed system (Netezza, 2011a). The reason IBM acquired Netezza is the expected increase in revenues that Netezza will generate from its portfolio (Dignan, 2010), as well as a lack of overlap in the customer base between IBM’s current client list and that of Netezza (Henschen, 2011b). Additionally, the acquisition fits in with IBM’s overall business analytics strategy, as IBM has marked BI as the key driver for IT infrastructure needs (Gartner, 2010). Microsoft SQL Server from Microsoft is a complete database platform designed for applications of various sizes. It can be deployed on normal servers as well as the ‘cloud’, allowing user clients to scale SQL Server to their respective needs. Purely a software player, Microsoft requires hardware partners to deploy its database offerings (Mackie, 2011). Microsoft, however, finds itself more under threat from low-cost or ‘free’ open source alternatives such as MySQL and PostgreSQL due to operating primarily in the low-end mid- market segment (Finkle, 2008). As such, if its clients are looking at alternative options, SQL Server may not be competitively priced for Microsoft to compete with open source RDBMS.
  • 42. What is the future of the RDBMS in the Enterprise? Page 30 SAP/Sybase Sybase, recently acquired by SAP, has three main business areas: OLTP using the Sybase ASE database, Analytic Technology using Sybase IQ, and, interestingly, Mobile Technology (Monash, 2010). This deal was required by SAP as it was coming under increasing pressure due to Oracle’s recent acquisition of SUN Microsystems, which gave Oracle a stronger focus on integrated products based around databases, middleware and applications (Yuhanna, 2010). The deal between SAP and Sybase gives both companies a lot of synergies – SAP finally acquires an enterprise-class database in the form of Sybase IQ, which SAP can now offer to its hundreds of client companies a database with columnar store and advanced compression capabilities (Yuhanna, 2010). A differentiator from SAP peers now comes with the acquisition of Sybase in the form of a mobile offering. Sybase has a number of mobile products for enterprises, including the Sybase Unwired Platform and iAnywhere Mobile Office suite. These technologies allow companies to connect mobile devices to a number of back-end data sources (Sybase, 2011). SAP now has the ability to offer its applications embedded in Sybase mobile platforms, using the synergy between the two to improve its competitive advantage and expand to other markets (Yuhanna, 2010). Indeed, efforts are now being made to cement Sybase’s lead in this segment of the market, with an initiative to make the Android OS platform enterprise ready. This involves porting Afaria, Sybase’s mobile device management and security solution, to the Android platform (Neil, 2011). With the growth of Android now reaching 30% of the smartphone market share in the United States (Warren, 2011), the future growth for Sybase in the mobile enterprise market looks strong. Finally, although big in the database market in the early 1990’s (Greenbaum, 2010), Sybase has been considered the fourth database vendor behind Oracle, Microsoft and IBM for the past decade. Its main market for Sybase’ OLTP offering, Sybase ASE, has been the financial services sector, with little penetration in other enterprise sectors. It is expected that SAP will make Sybase ASE more cost effective, and make another push in this segment of the market, maybe at the expense of the big three (Yuhanna, 2010).
  • 43. What is the future of the RDBMS in the Enterprise? Page 31 Teradata Teradata is a database vendor specialising in data warehousing and analytical applications (Prickett Morgan, 2010). During the last year, it was considered the best placed amongst its peers as a market leader in Data Warehousing (Feinberg, Bayer, 2011). This will be a hard position for competitors to dislodge as products in the DW market are considered difficult to replace (Bylund, 2011). Amongst its clients are multinational corporations such as 3M and PayPal (Teradata, 2011). One of Teradata’s products, the Teradata parallel database, designed for DW and OLAP functions, has an update and support revenue stream, as well as additional functions that customers are willing to pay for (Prickett Morgan, 2010). However, Teradata specialises in a single area of the database market – DW and analytics (Prickett Morgan, 2010). As such it is exposed to any weakness that may occur within that segment of the market. The company’s recent acquisition of Aprimo, an enterprise marketing firm with a strong emphasis on Marketing Research Management (MRM) and Campaign Management (CM). CM is considered by some as mission critical, as it allows marketers to unlock the value of customer data to develop multi-channel communications. Such an acquisition adds value to Teradata’s product portfolio, without competing with Teradata’s current product range, allowing the company to diversify its offerings to clients and future customers (Vittal, 2010). EMC/Greenplum Greenplum, a DW and Analytics firm acquired by EMC in 2010, is the foundation of EMC’s Data Computing division. Greenplum specialises in DW in the ‘cloud’, through its Chorus platform (Greenplum, 2011). EMC’s strategy for gaining market share is releasing a free community version of their database for testing, with the intent that they eventually purchase a commercial licence. It’s recently released ‘free’ Community Edition database, a heavily customised version of PostgreSQL, is targeted at companies and developers for whom Greenplum’s previous offering was not useful for creating parallel databases for DW and Analytics (Prickett
  • 44. What is the future of the RDBMS in the Enterprise? Page 32 Morgan, 2011). The purpose of the release is to allow developers to build and test Massive Parallel Processing (MPP) databases. If in the event that clients who develop these systems wish to use the software in a commercial environment, then they will be required to purchase a licence for the Greenplum Grade 4.0 database, EMC’s commercial DW offering (Kanaracus, 2011). It is hoped by EMC that customers wishing to have greater functionality with Greenplum’s database will upgrade to the Greenplum Grade 4.0 database (Kanaracus, 2011). 2.4.3 Non-RDBMS Market Open Source Databases There are a number of open source community developed database solutions available on the market today. However, due to these offerings generally being ‘free’, they don’t show up high on the list of databases in use by revenues earned – total deployment of open source databases can rival the total number of deployments from traditional vendors (Von Finck, 2009). All RDBMS applications hold a consistency model that can be inflexible for certain applications. The requirement for a record or table to be locked out from being viewed or otherwise accessed while changes are being made slows down queries that are attempting to generate results for end-users. Additionally, due to atomicity and consistency, not all RDBMS applications are scalable to the requirements of organisations that hold large quantities of data, such as Google and Facebook. With databases now employed that have tables of sizes in excess of 10 TB, the ability to query all that data will require speed and processing power that cannot be achieved to the requirements of user companies by traditional RDBMS offerings. Newer non-relational database offerings designed to meet these new requirements usually come in two options; MPP systems and Column-Store databases (Henschen, 2010).
  • 45. What is the future of the RDBMS in the Enterprise? Page 33 With the introduction of the Bigtable Distributed Storage System on top of the Google File System (GFS) in 2006 (Chang, et al, 2006), Google has demonstrated that non-relational databases can be scalable over multiple machines. Due to Bigtable’s proprietary nature however, efforts have been made over the past five years to develop open source versions of Google’s software, resulting in the arrival of the Apache Foundation’s Hadoop, initially developed by Yahoo (Bryant and Kwan, 2008). A number of companies have now utilised Hadoop and associated software to allow themselves to scale their database offerings to their own requirements. The growth of Hadoop can be inferred by unusual avenues. From 2007 through to early-mid 2009, IT requirements for expertise in Hadoop or MapReduce within the London area was .4 of 1% of the jobs market. By January 2011, the figure had grown to 1.2%, a 300% increase in the requirement for expertise within 2 years (IT Jobs Watch, 2011). Additionally, there was a 49% increase in Hadoop job postings in the United States from 2008 to 2009, with most of the job offerings being in California (Lorica, 2009). However, due to the lack of suitably qualified engineers for Hadoop and HBase within the industry at present, development projects at a number of companies have been affected due to the lack of staff. Within Silicon Valley, Google and Facebook are two companies that can afford to remunerate staff competitively due to their large sources of revenue. This has resulted in Cloudera, the Start-up cloud database company, being unable to offer top engineers remuneration at similar levels to their competitors. Cloudera have had to be imaginative in relation to its remuneration to staff. This includes setting up offices within downtown San Francisco, with the intention that staff would prefer to work in that location than Palo Alto or Mountain View, both 30 miles from the centre of San Francisco (Metz, 2011a). Such constraints will result in a lack of projects for new NoSQL databases until an adequate supply of qualified engineers become available, slowing growth for development and adoption of this new technology for the foreseeable future.
  • 46. What is the future of the RDBMS in the Enterprise? Page 34 Cassandra Cassandra is a distributed, column family database, developed at Facebook to solve an Inbox Search problem (Lakshman, 2008). It is now an open sourced project from the Apache Foundation (Apache, 2011). In addition to Facebook, additional users of the Cassandra database include the social news website Digg (Higginbotham, 2010), who decided to switch from MySQL to Cassandra due to scalability issues with MySQL. The rational behind the move was the decentralised nature of Cassandra and the fact that it has no single point of failure (Kerner, 2010). Unfortunately, the changeover to Cassandra was not run smoothly, resulting in Digg having to revert to MySQL to ensure data integrity, and allow its services to be available to its clients. The episode highlighted the pitfalls of switching from one architecture framework to another (Woods, 2010). Taking advantage of Cassandra’s introduction to the market, is Datastax – formerly Riptano (DBMS2, 2011), a start-up founded by the Cassandra project’s chair, Jonathan Ellis. The purpose of Datastax is to take commercial advantage of Cassandra, by selling expertise and technical support in Cassandra (Kerner, 2010), following the examples of Red Hat (Linux) and Cloudera (Cloud Computing) (Subramanian, 2010). HBase HBase is a non-relational database built on top of the Hadoop framework, using the Hadoop Distributed File System (HDFS). Originally developed out of a need to process large amounts of data, HBase is now a top-level Apache Foundation project (Zawodny, 2007). Due to HBases’ ability to scale to large sizes, the database has received attention within IT as a platform that can meet various companies’ requirements. Recent corporate announcements about their deployment of HBase, has increased the marketplace viability of HBase as a NoSQL database option (Metz, 2011b). These include both Facebook and Yahoo, 2 companies with large repositories of data. Facebook announced a new messaging platform, in which email, text messages and Instant Messages (IM), as well as Facebooks’ own messaging system, would be integrated together (Metz, 2010). Facebook experimented with a number of database offerings, including its own Cassandra database to see if it could handle the new system. Additionally, they excluded
  • 47. What is the future of the RDBMS in the Enterprise? Page 35 MySQL due to scalability issues. Eventually, they chose HBase, due to its consistency, as well as ability to scale across multiple machines (Muthukkaruppan, 2010). HBase was deployed by Yahoo to handle its news aggregation algorithm. The purpose of the new system is to data-mine content in order to optimise what the viewer sees on Yahoo’s web portal. In order for Yahoo to deploy to the website front page the most relevant news stories that people are viewing at any given moment in time, their requirement for the system was a database that could quickly query in real-time the most relevant items that people are interested in based on the number of clicks that story receives. Deployment of this new system has resulted in an increase in traffic to the Yahoo web portal, and subsequently resulted in an increase in revenues (Metz, 2008).
  • 48. What is the future of the RDBMS in the Enterprise? Page 36 2.5 Case Studies 2.5.1 Case Study 1- Utility companies and the data management challenge Introduction Utility companies are known to be one of the most conservative of enterprises when it comes to investing in technology (Fink, 2010; Fehrenbacher, 2010). There are many reasons for why this might be so; security of supply, regulatory compliance and financial austerity together with a lack of business drivers often leaves the risk averse utility threading water when it comes to IT investment (Tony Giroti, CEO Bridge Energy, 2011). However, things have been changing over the last few years. According to recent research by Lux, utilities (mainly power and water) will invest up to $34 billion in technology by the year 2020 (St. John, 2011). The reason arises from Smart Grid projects mainly and the growing avalanche of associated data which utilities will need to manage (St. John, 2011). For utilities, the business drivers required to justify investment in the kind of technology which enables integration of data across key business units have only recently emerged. Real-time applications just weren’t necessary before now (Giroti, 2011). Utilities History has shown how utilities are by and large reactionary when it comes to new ideas. For example, a snapshot of energy utilities related articles in the Pro Quest database (available through the TCD Library’s online resources) at various times over the last few decades shows flurries of activity around key moments of change in the industry. Cyclical changes from regulation to de-regulation of the energy sector in the early 1990s, begun in the US, kick- started reactionary strategy changes within the energy industry. Ireland followed the pattern with the Electricity Regulation Act of 1999 a program which is nearing completion. Fifty six articles on related subjects between 1992 and 1994 in contrast to just eighteen in following six years to the year 2000 (Pro Quest database) would seem to support this assertion. In the last decade or so innovation for utilities centred around the technology enabling Smart Grid and again an upsurge in articles on this subject stands out in a normally ‘steady state’
  • 49. What is the future of the RDBMS in the Enterprise? Page 37 sector. More recently the pressures of a diminishing supply and subsequent higher prices of raw material for energy production have propagated a sustainability drive. Compliance however has been a steady influence on energy utilities. What makes the Smart Grid attractive is the way it forces efficiency throughout the energy supply chain from generation to distribution resulting in less CO2 emissions – a major deliverable of the Kyoto agreement. Related to this has been the drive towards sustainable energy generation and supply. Vice President of Technology at Cobb Energy, Bob Arnett sums it up: “In today’s world, where utilities are focused on environmental concerns, resource constraints, and intelligent grids, it is sometimes hard to remember that in the mid- Nineties, the word of the day was ‘deregulation’.” (Arnett, 2011) This case study looks at utility companies in the context of these three key drivers: Regulation/Deregulation; Smart Grid and Sustainability. The case is stated in general terms initially but quickly moves to more specific Smart Grid applications in electricity supply companies, focussing in one Irish energy company’s use of databases in its implementation of Smart Grid applications. As the ESB’s (Electricity Supply Board) Tom Geraghty said of Smart Metering in a recent interview with Silicon Republic: “How you get data back from the electronic metre to a utility central point where it is aggregated and the bill is sent out to simply allowing people to top up their metre at home as if it were a mobile phone shows you the complexity that lies ahead. There are many imaginative options emerging and the opportunities are endless,” (in Kennedy, 2011) One estimation from Lux research puts the increase of data coming from the Smart Grid at 900% by 2020 (St. John, 2011). Tony Giroti puts this in more tangible terms- 1 million smart meters passing data every 15 min equates to 30 TB of data per year to be handled, stored and harvested (Giroti, 2011). This figure doesn’t include the real time data flowing through the system as part of the self-healing attribute of Smart Grids.
  • 50. What is the future of the RDBMS in the Enterprise? Page 38 The problem can be placed within the wider question asked in this dissertation, that is, what is the future of the traditional RDBMS in the enterprise? To this end, this case study predicates that the general feeling towards newer database management solutions such as open source and NoSQL is that while they are attractive for certain non-core applications, they are not yet up to the task of the more serious mission critical functions of control systems, financial transactions and customer management within enterprises. This study investigates the problem in the context of traditionally risk averse utility companies and questions if new business drivers (of which the Smart Grid is key) are forcing a rethink on this issue. A public utility company is an enterprise which provides key services to the public most typically electricity, gas, water, and transportation. They may be state or private owned. They may operate in a regulated, deregulated or even semi-regulated market (Legal Dictionary). The energy sector in Ireland is currently under going dramatic change. The two largest energy companies in Ireland, the Electricity Supply Board (ESB) and Bord Gáis are commercially run enterprises and are both majority owned by the state. Both companies have recently entered into each others markets as a result of the state’s requirement (and driven by the EU) to open up the energy market in an attempt to improve competitiveness the sector for the benefit of consumers (Irish Government White Paper, 2007). One result of this restructuring of the sector is that the separate electricity and gas markets have been combined and the sector is now generally referred to as the energy market. The functions carried out by utility companies differ according to the services they provide. Energy suppliers are similar in the functions they carry out such as generation, transmission and distribution of energy. Water utilities in other countries have moved towards a revenue generating model for water supply and Ireland rightly or wrongly may soon follow suit. Each core function contains a number of supporting IT applications. Each of these in turn is supported by a suitable data management system. Some of the major solutions used in energy utilities include: Geographical Information System (GIS); Meter Data Management (MDM); Customer Information System (CIS); Distribution Management System (DMS); Supervisory Control and Data Acquisition (SCADA); and Outage Management System (OMS). Figure 2.3 shows where some of these systems fit into the overall network.
  • 51. What is the future of the RDBMS in the Enterprise? Page 39 Each of these systems provides support the specific needs of the different business functions, such as, supply, generation, distribution, trading, and operations. As such they may or may not be integrated. In relation to meter data management (MDM) Giroti again states the problem succinctly in his paper entitled “You’ve Got the Meter Data – Now What?” (2011), where he gives two options: Have a proactive strategy for integrating and managing data coming from the Grid, or... Be reactive in response to problems as they appear at the risk of being left behind by competitors adopting the former strategy. Smart Grid - The ESB case The European Technology Platform definition of smart grids is - “electricity networks that can intelligently integrate the behaviour and actions of all users connected to it - generators, consumers and those that do both – in order to efficiently deliver sustainable, economic and secure electricity supplies” (Smart Grids: European Technology Platform, 2010) Successful smart grid implementation depends on how enterprises utilise information systems in managing the torrent of data heading their way. This issue puts data management systems right back in the foreground of the IT game. The ESB plans to invest up to €11 billion in sustainable projects including a Smart Grid (Strategy Framework 2020). The ESB began a pilot project for advanced metering in 2007. Advanced meters occupy what is termed the head end of the smart grid. They reside on customer premises or at the company’s own locations typically at the edge distribution network. The ESB has to date installed 6,500 smart meters. The estimated total installations required for full implementation is over two million. The data consists of messages to and from a central management system called a meter data management system (MDM). The message can be meter data relating to load readings, voltage and temperature measurements, outages, faults and other events. The ESB’s existing data management platforms includes solutions from Oracle, IBM and Microsoft. Currently no open source or NoSQL solutions exist in any official way in the company. A preliminary evaluation of the open source database solution MySQL was carried
  • 52. What is the future of the RDBMS in the Enterprise? Page 40 out by the IT department in 2010 but no decision on implementation has been made as yet. MySQL is now under the roof of the Oracle house following its acquisition of Sun Microsystems in 2010 (Lohr, 2009). Image source: http://www.consumerenergyreport.com/wpcontent/uploads/2010/04/smartgrid.jpg Figure 2.2 – Overview of a generic Smart Grid
  • 53. What is the future of the RDBMS in the Enterprise? Page 41 (Image source: EPRI) Figure 2.3 - ESB proposed implementation of Advanced Metering (Key area of interest is circled) The Data Volume Problem A traditional electricity grid is made up of electro-mechanical components that link electricity generation, transmission and distribution to consumers. A smart grid builds on advanced digital SCADA devices involving two-way communication of data of interest to utilities, consumers and government (Financial Times, Nov 2010). Figures for how much data will flow vary depending on the implementation of smart grid. Estimates from the ESB’s trials involving 6,500 meters show a substantial increase in the amount of data required to be stored and analysed at the back end. Utilities it seems are not immune to ‘Big Data’. Tony Giroti is qualified to comment on the issue. He is one of only 13 elected members of Gridwise Architecture Council formed by the US Department of Energy for the purpose of articulating the way forward for intelligent energy systems. In his article for the e-magazine Electric Energy Online “You’ve Got the Meter Data – Now What?”, (2011), Giroti states the data volume problem as such:
  • 54. What is the future of the RDBMS in the Enterprise? Page 42 Figure 2.4 – Smart Meters transaction rate Girotti foresees the storage and processing concerns associated with this volume of data. Figure 2.5 – Smart Meters data size Processing of this data also presents a challenge to system architects. Gathering of data from a million smart-meters at 15-minute intervals as per the example above equates to 1,111 transactions per second, or 90 million transactions per day. The problem is further compounded by the critical requirement of the system to analyse network event transactions in real-time in responding to fluctuations in demand and fault response (Giroti, 2011). One limitation of Girotti’s claim is that there is no indication in the article of how the one kilobyte per transaction figure is calculated. This is an important factor for vendors of back end processing running off relational databases. The lower this number is the better. Some systems rely on filtering out less important data at the source, that is, at the meter itself rather than storing superfluous data at the back end. For example, meter location information does not change and can be sent only once. Even at a conservative data size of 128 bytes per 1 Million Smart Meters Hourly Collections of data => 3.6Gigabytes of data per day to be stored, analysed and backed up 1Kb per transaction per meter = 1.1Mbs 1 Million Smart Meters 1 read every 15 mins 1 Million meter reads 15 mins x 60 secs 1,111 Transactions per sec