1. What is the future of the RDBMS in the Enterprise?
School of Computer Science and Statistics
TRINITY COLLEGE
What is the future of the RDBMS in the Enterprise?
Stuart Clancy
Edward Fitzpatrick
Degree Year
BSc (Hons) Information Systems
11th April 2011
2. A Dissertation submitted to the University of Dublin in partial fulfilment of the
requirements for the degree of BSc (Hons) Information Systems
Date of Submission: 11th April 2011
3. What is the future of the RDBMS in the Enterprise?
- III -
Declaration
We declare that the work described in this dissertation is, except where otherwise stated,
entirely our own work, and has not been submitted as an exercise for a degree at this or any
other university.
Signed:___________________
Stuart Clancy
Date of Submission:
Signed:___________________
Edward Fitzpatrick
Date of Submission:
4. What is the future of the RDBMS in the Enterprise?
- IV -
Permission to lend and/or copy
We agree that the School of Computer Science and Statistics, Trinity College may lend or
copy this dissertation upon request.
Signed:___________________
Stuart Clancy
Date of Submission:
Signed:___________________
Edward Fitzpatrick
Date of Submission:
5. What is the future of the RDBMS in the Enterprise?
- V -
Acknowledgements
We would like to acknowledge and thank Ronan Donagher, our project supervisor and Diana
Wilson, the acting course director for their support, guidance and understanding throughout
our research project.
We would also like to acknowledge the unfailing support of our families, who have
encouraged us throughout the years of our study; our employers and work colleagues, who
have been patient and flexible with working arrangements in order to allow us to complete
our studies; and close friends who on occasion are called upon to provide a welcome
distraction and perspective.
Signed:___________________
Stuart Clancy
11th
April 2011
Signed:___________________
Edward Fitzpatrick
11th
April 2011
6. What is the future of the RDBMS in the Enterprise?
- VI -
Abstract
Managing data and information has been feature of human activity since the first
acknowledged symbols were etched onto stones by Neolithic humans. Since the emergence
of the Internet data as an available resource to man and machine has been growing rapidly.
This dissertation looks at what this means for the traditional relational database management
system (RDBMS). It asks if there is a future for the RDBMS in enterprise information system
architecture. It also examines the early developmental years of RDBMS in order to gain an
insight as why it has enjoyed relative longevity within a rapidly changing technology
environment. New types of database and data management systems are discussed such as
NoSQL and other open source non-relational DBMS such as Hadoop and Cassandra. The
data volume and data type problem is absorbed into various sections under the umbrella term
‘Big Data’. Utility companies and social networking sites are two sectors where the
management of large data volumes is a growing concern are examined in the two case
studies. A separate chapter on the research methodology chosen by us is included. It provides
the necessary balance between subject matter and method as set out in the initial
requirements.
Keywords:
Relational Theory, DBMS, RDBMS History, NoSQL, Hadoop, Cassandra, Database Market,
Big Data, Research Methodology.
7. What is the future of the RDBMS in the Enterprise?
- VII -
Table of Contents
Abstract....................................................................................................................................VI
List of Figures...........................................................................................................................X
List of Tables.............................................................................................................................X
List of Abbreviations...............................................................................................................XI
Chapter One - Introduction.................................................................................................... 1
1.1 The Research Question ........................................................................................... 1
1.2 Document Roadmap ................................................................................................ 2
Chapter Two - Literature review, findings and analysis ......................................................... 4
2.1 Introduction............................................................................................................. 4
2.2 RDBMS................................................................................................................... 4
2.2.1 History of the RDBMS ....................................................................................... 10
2.2.2 Main Features of ‘true’ RDBMS......................................................................... 13
2.2.3 IBM, Ellison and the University of California, Berkley....................................... 15
2.3 New Databases ...................................................................................................... 19
2.3.1 Features of NoSQL Databases ............................................................................ 20
2.3.2 Hadoop............................................................................................................... 23
2.3.2.1 Components of Hadoop ................................................................................... 24
2.3.3 Cassandra ........................................................................................................... 25
2.4 The market for RDBMS’ and Non-Relational DBMS’........................................... 27
2.4.1 Introduction........................................................................................................ 27
2.4.2. RDBMS Market................................................................................................. 27
2.4.2.1 Vendor Offerings............................................................................................. 28
2.4.4 Open Source Databases....................................................................................... 32
8. What is the future of the RDBMS in the Enterprise?
- VIII -
2.4.4.1 Non-RDBMS Market....................................................................................... 32
2.5 Case Studies .......................................................................................................... 36
2.5.1 Case Study 1- Utility Companies and the Data Management challenge ............... 36
2.5.1.1 Introduction ..................................................................................................... 36
2.5.1.2 Utilities............................................................................................................ 36
2.5.1.3 Smart Grid - The ESB case .............................................................................. 39
2.5.1.4 The Data Volume Problem............................................................................... 41
2.5.1.5 How one utility company is meeting the data volume challenge....................... 44
2.5.1.6 What is the ESB doing? ................................................................................... 45
2.5.1.7 Conclusion....................................................................................................... 46
2.5.2 Case Study 2 - Social Networks – The migration to Non-SQL database models.. 47
2.5.2.1 Facebook Messages ......................................................................................... 48
2.5.2.2 Twitter - The use of NoSQL databases at Twitter............................................. 49
Chapter Three - Research Methodology .............................................................................. 52
3.1 Introduction........................................................................................................... 52
3.2 The strategy adopted for researching the question.................................................. 53
3.3 A Theoretical Framework...................................................................................... 55
3.4 Research Design.................................................................................................... 57
3.5 Methodology - A Qualitative Approach................................................................. 58
3.6 Methods................................................................................................................. 58
3.6.1 Method - Analytic Induction............................................................................... 59
3.6.2 Method - Content Analysis ................................................................................. 59
3.6.3 Method - Historical Research.............................................................................. 59
3.6.4 Method - Case Study........................................................................................... 60
3.6.5 Method - Grounded Theory ................................................................................ 60
3.7 Ethics Approval.................................................................................................... 61
3.8 Audience .............................................................................................................. 61
9. What is the future of the RDBMS in the Enterprise?
- IX -
3.9 Significance of research......................................................................................... 61
3.10 Limitations of the research methodology ............................................................. 62
3.11 Conclusion....................................................................................................... 62
Chapter Four - Conclusions, Limitations of Research and Future Work............................... 63
4.1 Introduction........................................................................................................... 63
4.2 Conclusions........................................................................................................... 64
4.2.1 RDBMS.............................................................................................................. 64
4.2.2 New DB’s........................................................................................................... 64
4.2.3 Market................................................................................................................ 65
4.2.4.1 Case Study 1 - Utility Companies .................................................................... 66
4.2.4.2 Case study 2 - Social Networks........................................................................ 66
4.3 Future Research..................................................................................................... 67
4.3.1 NoSQL ............................................................................................................... 67
4.3.2 Case Studies ....................................................................................................... 68
4.3.3 Business Intelligence .......................................................................................... 68
4.3.4 Research Methodology ....................................................................................... 68
4.4 Limitations of the Research ................................................................................... 69
4.5 Final thoughts........................................................................................................ 70
REFERENCES............................................................................................................ 71
APPENDIX 1.............................................................................................................. 85
10. What is the future of the RDBMS in the Enterprise?
- X -
List of Figures
Figure 2.1 - A simplified DBMS .......................................................................................... 9
Figure 2.2 – Overview of a generic Smart Grid ................................................................... 40
Figure 2.3 - ESB proposed implementation of Advanced Metering ..................................... 41
Figure 2.4 – Smart Meters transaction rate…………………………………………………..42
Figure 2.5 – Smart Meters data size………………………………………………………….42
Figure 2.6 - Sources of Smart Grid data with time dependencies……………………………43
List of Tables
Table 2.1 - Impact of unstructured data on productivity............................................................8
Table 2.2 – Example of redundant rows in a database.............................................................14
Table 3.1 - Key concepts in Qualitative and Quantitative research methodologies.................54
Table A.1 - Edgar Codd’s original relational model terms…………………………………..85
11. What is the future of the RDBMS in the Enterprise?
- XI -
List of Abbreviations
ACID – Atomicity, Consistency, Isolation and Durability.
ACM – Association of Computing Machinery.
BA - Business Analytics.
BASE - Basically Available, Soft state, Eventual consistency.
BI - Business Intelligence.
BSD - Berkeley Software Distribution.
CA - Computer Associates.
CAP - Consistency, Availability and Partition tolerance.
CIS – Customer Information System.
CODASYL – Conference on Data Systems Language.
CRM - Customer Relationship Management.
DBMS – Database Management Sys
DMS – Distribution Management System.
DW- Data Warehousing.
ERM - Enterprise Relationship Management.
GB - Gigabyte
GBT - Google Big Table.
GFS - Google File System.
GIS - Geographical Information System.
HA - High Availability.
HDFS - Hadoop Distributed File System.
IA – IBM’s Information Architecture.
12. What is the future of the RDBMS in the Enterprise?
- XII -
IBM - International Business Machines.
ISV - Independent Software Vendor.
IT – Information Technology.
KB - Kilobyte
MB - Megabyte
MDM - Meter Data Management.
MPL - Mozilla Public Licence.
MR - MapReduce.
NoSQL – ‘No’ SQL or more often ‘Not Only’ SQL.
OEM - Original Equipment Manufacturer.
OLAP - Online Application Processing.
OLTP - Online Transaction Processing.
OMS - Outage Management System.
OS - Operating System.
OSI - Open Source Initiative.
PB - Petabyte
PDC - Phasor Data Concentrators
PLM - Product Life-cycle Management.
RDBMS - Relational Database Management System.
SCADA - Supervisory Control and Data Acquisition.
SOA - Service Oriented Architecture.
SQL - Structured Query Language.
TB - Terabyte
13. Page 1
Chapter One - Introduction
Humans have being storing information outside of the brain probably before the first
consistent markings on a bone were found in Bulgaria dating from more than a million years
ago. Certainly so since the later Neolithic clay calculi bearing symbols representing
quantities, the cave paintings at Lascaux over 17,000 ago; through to the invention of the
moveable type printing press and eventually to the first computers. Since the emergence of
the information age of the last fifty years or so the amount of data transferred and stored in
computers has grown rapidly. Research from the International Data Corp (IDC) in 2008 puts
that growth at 60% per annum (The Economist, 2010).
An added complexity is that executive strategies now have business intelligence for
competitive edge as a key goal. Data management systems that for many years have been the
old reliable work horse toiling away in the back end somewhere are once again playing a key
role in driving business growth. The question is, are they still capable of carrying out this new
and challenging task? This dissertation asks that question and more specifically what is the
future for the Relational Database Management System (RDBMS) in the Enterprise?
The data volume problem now has a name ‘Big Data’. Its nascence coincides with the growth
of the Internet. Alternative solutions to traditional RDBMS to deal with ‘Big Data’ soon
followed. Much of these solutions are either based on multi-parallel processing (MPP a.k.a
distributed computing) or flipping the row store of RDBMS into column store systems. More
recently MPP solutions are being positioned not as alternatives but complements to RDBMS
(Stonebraker et al., 2010). Add to this mix a dynamic data management market where
vendors are acquiring new technology, merging with each other, adopting open source and
creating hybrid stacks in an effort to gain advantage in a market deemed to grow to $32
billion by 2013 (Yuhanna, 2009).
1.1 The Research Question
Time was taken to carefully frame our research question so as to provide a clear path of
exploration on the subject. The subject could have been framed as a predicated hypothesis
such as: “The future for RDBMS in the Enterprise is looking bright” or a contrary statement
“The end is nigh for RDBMS”. We chose to frame our research as an open ended question to
14. What is the future of the RDBMS in the Enterprise?
Page 2
allow for a broad exploration of the subject with no preconception of the outcome. The
broadness of scope however is necessarily tempered by restricting our research to those
organisations defined as enterprises. There is difficulty here as there is no overarching
definition for an enterprise organisation. However, it is necessary to provide some clear
defined boundaries around the term. For this dissertation an enterprise is defined not by size
or function alone.
Enterprises for us are organisations where the scale of control is large. They include
companies with a large amount of customers and employees, as well as companies that
control a large infrastructure or several functional units. Enterprises have one top-level
strategy to which all other functional units are aligned. The last point is an important
characteristic of an enterprise for our dissertation as it applies to decision making for
acquiring information management systems.
The presence of the word 'future' is central to locating the research in an exploratory and
intuitive research domain. It prompts looking into the past in an attempt to explain the present
and predict the future. It forces an open mind and questioning approach. It enables the
creation of new ideas which are either taken on or set aside for another time. The chapters
and sections are set out below in an attempt to follow this map in the view that the journey is
the objective rather than the destination.
1.2 Document Roadmap
In writing this dissertation a balance was sought between addressing the issues raised by the
initial question and the research methodology chosen. The bulk of this dissertation therefore
centres on those two areas. In this chapter we introduce the concept of our research and why
we feel it is interesting to us. The research question is explained and the objective is put in
context. Chapter two contains the literature review. The chapter begins with an outline of
RDBMS, its features and history of development. Particular attention is given to the role of
IBM in the development of RDBMS. The chapter moves on to discuss new databases and
data management systems. A section on the DBMS market follows and presents an overview
of the current vendor offerings. The market section does not attempt a comparison of
available systems as this work was carried out in greater detail by others more expert than us.
Throughout the dissertation we refer the reader to such work where it is not feasible for us to
reproduce it.
15. What is the future of the RDBMS in the Enterprise?
Page 3
Two case studies are included for the benefit of putting the research question in a practical
context. The two areas chosen involve contrasting enterprises. On one hand there is the
relatively long established utilities sector and on the other the new phenomenon of social
networking and its associated companies. Even though they operate in widely different
markets generating different types of data, they both share similar problems when it comes to
managing large amounts of data. Likewise, both are trying to get to grips with extracting
value out of data for competitive edge.
Chapter three discuss the research methodology chosen by us. It deserves a chapter to itself in
view of the objective of this dissertation. The chapter begins with an introduction on research
theory. It then moves to a discussion on our research strategy. A research framework is
introduced as a model of our strategy. The different methodologies available are outlined and
our chosen option is explained. Next, a group of related research methods are outlined and
the reason for their selection is stated. Short sections on ethics approval, audience and the
significance of the research follows before a final section on the limitations of our chosen
research methodology closes the chapter.
The final chapter attempts to pull together the conclusions and findings from the all the
previous sections. Relevant research threads and ideas not covered in sufficient detail in the
dissertation are mentioned. The last sections present a summary of the limitations of the
overall research and our final concluding thoughts.
16. What is the future of the RDBMS in the Enterprise?
Page 4
Chapter Two - Literature review, findings and analysis
2.1 Introduction
In this section the focus is on RDBMS. The intention is to provide an overview of its defining
features. It is not an in-depth technical analysis of RDBMS and we would refer the reader to
better papers on the subject such as those published in the Communications of the
Association of Computing Machinery (ACM) of which we refer to several times. It also sets
out the background to the development of RDBMS. Within that context an interesting
discovery is made with respect to IBM’s initial role in the development of database
management systems. For the purpose of exploring the question on the future of RDBMS
some associated concepts are discussed such as data types, ‘true’ RDBMS, and whether or
not the past can teach us something about the future.
2.2 RDBMS
Databases
It is unfortunate that in realm of Information Technology (IT) acronyms are not always self-
explanatory. Many such acronyms don’t travel outside of their specific domain very well.
Take for example DQDB or Distributed Queue Dual Bus; outside of the world of high speed
networks this may seem to be a very efficient urban transport vehicle. Luckily the term
RDBMS contains within itself the individual components which define it: a system (S)
composed of a database (DB) where information is stored by creating relationships (R)
between data elements and which can be managed (M) by users. It is helpful at this point to
explain the hierarchy, at least, of each of these components.
Throughout this dissertation data (and datum-singular) and information are taken to be a
classifications of entities stored in a system. Data being lowest in the sense of the taxonomy
data – information – knowledge - wisdom (sometimes called understanding) but not lower in
17. What is the future of the RDBMS in the Enterprise?
Page 5
real value; a single digit integer may be enough data to invoke the required wisdom to make
an important decision. For the purpose of simplicity, data here means a binary entry (such as
yes or no, 1 or 0), or a nominal entry (such as dog, 470, Smith, XRA9000 etc.). An analogy
from biology might see data as the molecules which make up a cell of information. The word
‘molecules’ is carefully suggested instead of ‘atoms’ given that ‘atomicity’ has particular
significance for relational databases. Permitting an extension of the analogy would see a body
of knowledge built from the cells of information. It would be unwise to stretch the analogy
further to address wisdom. Unhelpfully, the words ‘data’ and ‘information’ are often
interchangeable terms in research literature. Some examples of this are the concepts ‘Big
Data’ and ‘unstructured data’ for what really ought to be called information. For this reason
and for the purpose of consistency this dissertation will hold with the literature and consider
the two terms as one except where a distinction is required.
A database has been defined in a number of sources as a “collection of related data or
information” (Bocij et al. 2006, p. 153; Elmasri and Navathe, 1989, p. 3).
The Oxford English dictionary defines a database as a “structured set of data held in a
computer” (OED). However, the Cambridge Advanced Learner’s online dictionary (2011)
definition is perhaps closer to a contemporary definition:
“A large amount of information stored in a computer system in such a way that it can
be easily looked at or changed”.
It is noted that the definition in the later online edition of the Cambridge (2011) does not have
any explicit reference to relational, structured or organised data. This looser definition
reflects the changing nature of data management as newer types and bigger volumes of data
are being captured.
Finally, a definition from the business world which expands on the above mentioning
different types of data and hints at the issues regarding scale:
A database is “a systematically organized or structured repository of indexed information
(usually as a group of linked data files) that allows easy retrieval, updating, analysis, and
output of data. Stored usually in a computer, this data could be in the form of graphics,
18. What is the future of the RDBMS in the Enterprise?
Page 6
reports, scripts, tables, text, etc., representing almost every kind of information.” (Business
Dictionary, 2011).
Structured and unstructured data.
The last definition above alludes to unstructured data. Unstructured data is data in the form of
text (words, messages, symbols, emails, sms texts, reports) or bitmaps (images, graphics). A
good example of the growing relevance of unstructured information is a Facebook page
containing images, short messages, links, and chunks of text that can be altered at any time.
Structured data by contrast is any data “that has an enforced composition to the atomic data
types” (Weglarz, 2004). Atomicity is a characteristic of stored entity which is not divisible
(Elmasri and Navathe, 1989, p. 41). Atomicity is a key necessity for defining structured data
and is what relational databases rely on to make relationships. A database designer can decide
on the exact rules for the structured data and the level of atomicity required. As an aside, it is
often this small amount of flexibility in the design of the data model which is responsible for
the creation of many ‘bad’ databases. Structured data is data that is consistent, unambiguous
and conforms to a predefined standard. Structured data will be examined in more detail later
under the section discussing RDBMS. A third type is semi-structured data. This is data held
in a standard format such as forms, spreadsheets and XML files. This type of data can be
parsed by computer programs more easily than unstructured data due to the data generally
being located in a fixed and known place, even if the data itself is not atomic.
The problem of structured versus unstructured data types can be stated using the example of
two schools. One school grades students in the traditional way by giving a numerical grade
following examination. Another school does not give numerical grade to students, preferring
a method whereby students are furnished with a qualitative report on their overall
performance. The former is structured data as the meaning of a grade of 82% is consistent in
the context of the schools grading system. It can be easily recorded, measured, and compared
to other grades internally or from other schools using the same system. The report format
however is unstructured and comparison with a numerical grading system is not so easy.
Gleaning relevant information from a text report is complex and involves semantic analysis
with or without the help of technology.
19. What is the future of the RDBMS in the Enterprise?
Page 7
What does this mean for enterprises?
Eighty percent of information relevant to business is unstructured and is mostly textual form
(Langseth in Grimes, 2011). Seth Grimes an analytics expert with the Alta Plana Corporation
has previously investigated this claim. He concludes that even if the origins of the 80% are
elusive (Grimes tracks back as far as the 1990’s) experience supports the claim (Grimes,
2011). Patricia Selinger (IBM and ACM Fellow) who has worked on query optimisation for
27 years puts unstructured data in companies at about 85% (Selinger, 2005). Even assuming a
lower figure than 80% for unstructured data in larger enterprises, where much information is
in structured forms held in traditional transaction based databases, there is still the problem of
how to leverage competitive advantage out of the nuggets of information buried in the rich
seams of unstructured data. Businesses are realising that the chances of extracting valuable
wisdom from traditional data stores using stale analysis methods and tools are diminishing
and that new ideas are needed.
Unstructured data is growing faster than structured data, according to the "IDC Enterprise
Disk Storage Consumption Model" 2008 report, “while transactional data is projected to
grow at a compound annual growth rate (CAGR) of 21.8%, it's far outpaced by a 61.7%
CAGR prediction for unstructured data” (Pariseau, 2008).
Kevin McIssac (2007) of Computer World magazine puts it into perspective:
“Unfortunately business is drowning in unstructured data and does not yet have the
applications to transform that data into information and knowledge. As a result staff
productivity around unstructured data is still relatively low.”
McIssac gives examples of the impact of unstructured data on productivity citing research
from various sources. Table 2.1 below summarises those impacts:
20. What is the future of the RDBMS in the Enterprise?
Page 8
Time/Volume Impacts on Research Source
9.5 hours per
week
Average time an office worker spends
searching, gathering and analysing
information (60% of that on the Internet)
Outsell
10% of working
time
Time professionals in creative industry
spend on file management.
GISTICS
600 e-mails per
week
Sent and received by a typical business
person.
Ferris Research
49 minutes per
day
Time an office worker spends managing e-
mail. Longer for middle and upper
management.
ePolicy Institute
Table 2.1 - Impact of unstructured data on productivity.
Where are the joins?
It seems that a reappraisal of what a database is or needs to do is well under way. If this is so,
then this reappraisal logically extends to the database management system. Structured data
can be joined to other structured data to form concatenations of information using a query
language based on mathematical operations. Things get a little more ‘fuzzy’ with
unstructured data. Stock market analysts might like to try querying an online media sources
for all posts where the word ‘oil’ is used but only in the context of the recent crises in Libya.
How unstructured and unrelated data is to be stored in the system and how meaningful
information can be retrieved back out of that same system are questions many organisations
are now asking – but, similar questions were asked before and the past may hold some
lessons for us.
21. What is the future of the RDBMS in the Enterprise?
Page 9
A DBMS
In its simplest definition a DBMS is a set of computer programs that allows users to create
and maintain a database (Elmasri & Navanthe, 1989 p. 4). Bocij et al. (2006, p. 154) expands
on this definition a little: “One or more computer programs that allow users to enter, store,
organise, manipulate and retrieve data from a database.”
(Source: Elmasri and Navathe, 1989 p. 5)
Figure 2.1 - A simplified DBMS
Figure 1 above shows the key components of a data management system. A detailed
description of each of the components of the system is not necessary for our purpose but
briefly they are:
• Application programs with which users can interact with the stored data.
• Software programs for processing and accessing the stored data.
• A high-level declarative language interface for executing commands (commonly
known as a query language).
22. What is the future of the RDBMS in the Enterprise?
Page 10
• A repository for storing data.
• A store of information related to the data for classifying or indexing purposes (meta-
data)
• Hardware suitable for each of the above functions
• Users (includes database administrators and designers)
2.2.1 History of the RDBMS
To understand why newer types of databases and data management systems are emerging and
taking hold it seems reasonable to explore why RDBMS’ came into existence, as well as their
usefulness and relative longevity.
The 1960’s BC (Before Codd)
Data management systems existed before Edgar Codd, while at IBM, wrote his seminal paper
published in 1970 called “A Relational Model of Data for Large Shared Data Banks”. Codd’s
paper presented a new database model and hence introduced the world of database
management to relational theory (Codd, 1970). In his paper Codd discusses the limitations of
the existing hierarchal and network data systems and introduces a query language based on
relational algebra and predicate calculus.
In a later important paper he described 12 rules for a relational database management system
(Codd, 1985). Systems that satisfy all 12 rules are rare. In fact, it is argued that no truly
relational database systems existed in wide commercial production even a decade after
Codd’s vision (Don Heitzmann in Thiel, 1982), and even up to more recently (Anthes, 2010).
A brief description of the two data management systems (of whose limitations Codd
addressed) is a useful precursor to a broader description of relational DBMS’.
Hierarchal Data Models
Hierarchal data models are similar to tree-structured file systems in that the data is stored as
parent-child relationship. Codd asserts that hierarchal and network based DBMS’ were not
data models in comparison to his more formalised Relational model. (Codd, 1991). For
simplicity the word ‘model’ is maintained for the data structure of all systems under
23. What is the future of the RDBMS in the Enterprise?
Page 11
discussion here. The model made sense to organisations that were naturally hierarchal in
nature - a legacy of Henri Fayol and his 14 management principles, popular in the 1960’s and
still used in organisations today (Stoner and Freeman, 1989; Tiernan et al., 2006). A
hierarchal data model can be presented as a tree-structure of parent-child relationships or as
an adjancy list. For example: a root entity with no parent might be SCHOOL; STUDENT is a
child of SCHOOL; GRADE is a child of STUDENT. STUDENT is also a child of COURSE.
In this type of structure data can be replicated many times in different branches of the tree, a
relationship of ‘one to many’ or 1:N. A ‘modified preorder tree traversal' algorithm is used to
number each entity on the way down through the tree-structure (left value) and again on the
way back up to the root (right value). Thus, making the query operations more efficient in
navigating around the data (Van Tulder, 2003).
The first hierarchal DBMS was developed by IBM and North American Aviation in the late
1960’s (Elmasri and Navathe, 1989 p. 278). IBM imaginatively called it Information
Management System (IMS) and Frank Hayes dates its roll out to 1968 (Hayes, 2002). Elmasri
and Navathe cite McGee (1977) for a good overview of IMS (1989, p. 278).
Network Data Models
As can be seen in the hierarchal data model above a child could have many parents. A
STUDENT for instance, can take more than one MODULE in any COURSE YEAR. In a
hierarchal structure the same STUDENT would appear under each of the MODULE trees. In
other words many students can take many modules. The Network data model was a further
development of the hierarchal model to address the issue of managing ‘many to many’ (M:N)
relationships. The Conference on Data Systems Languages (CODASYL) defined the network
model in 1971 (Elmasri and Navathe, 1989).
Where the underlying principle of the hierarchal model was parent-child tree structures, in a
network model it is set theory. Records are classified into record types and given names.
These records are sets of related data. Record types are akin to tables in a relational database
model. The intricacies of set theory are beyond the scope of this dissertation; however, it
suffices to say that complex data combinations can be achieved by nesting record types
within other record types – data sets as members of other data sets. If this were possible in a
relational database it would be like having tables within tables within tables.
24. What is the future of the RDBMS in the Enterprise?
Page 12
The earliest work on a network data model was carried out by Charles Bachman in 1961
while working for General Electric. His work resulted in the first commercial DBMS called
Integrated Data Store (IDS) which ran on IBM mainframes. The system was cumbersome and
was eventually redeveloped by an IDS customer, BF Goodrich Chemical Company into what
was called IDMS (Hayes, 2002). With Bachman on board as a consultant, IDMS was
eventually commercialised by Cullinane/Cullinet Software in the 1980’s. Cullinet was bought
by Computer Associates (CA) in 1989. IDMS is a current offering by CA for mainframe
database management systems today. Charles Bachman received the Turing Award in 1973
for his pioneering work in developing the first commercially available data management
system, for being one of the founders of CODYSYL and for his work on representation
methods for data structures (Canning in Bachmann, 1973).
The 1970’s
Adabas DBMS was developed in the 1970 by Software AG. It has an interesting feature of
relevance to this dissertation. Adabas was designed to run on mainframes for enterprises with
large data sets and requiring fast response times for multiple users. One of its main features is
that it indexes data using inverted-list type indexing.
Adabas also features a data storage address convertor which avoids data fragmentation. Data
fragmentation can occur when a record is updated with additional data. The record is now too
large to be stored in the original location. The data can be moved to a new location but the
indexes still expect the data to be in the same place so they also have to be updated. The
address convertor does this. The alternative as used by other systems is data fragmentation;
part of the data is stored in the original location with a pointer to where the remainder is
stored. Fragmentation and pointer methods however require additional processing and hence
slower response times. The problem of using pointers in systems predating RDBMS instead
of storing data directly (in tuples as is done in RDBMS) is referred to by IBM’s Irv Traiger
(in McJones, 1997 pp. 16-17).
According to Carl Monash, Adabas’ inverted-list indexing is the favoured method for
searching textual content. New ideas regarding the management of text (unstructured data)
has according to Monash “at least the potential of being retrofitted to ADABAS, should the
payoff be sufficiently high” (Monash, Dec 8 2007).
25. What is the future of the RDBMS in the Enterprise?
Page 13
Edgar Codd and the birth of the Relational Model
Codd’s text ‘The Relational Model for Database Management’ of 1990 (version 2, 1991)
brings together his ideas set out in his previous papers regarding Relational Data Model for
managing databases. In it he places his model as solidly based on two areas of mathematics:
Predicate Logic and Relational Theory. In order for the maths to work effectively, there are
four essential concepts associated with the relational model: domains, primary keys, foreign
keys and no duplicate rows. In particular, the importance of Domains has not been
understood fully or adopted by later commercial versions of his RDBMS (Codd 1991, pg18).
Also, two early prototypes IBM’s System R and Berkley University and Michael
Stonebraker’s INGRES were not concerned about the need to address the issue of duplicated
rows. The designers of both those systems felt that the additional processing required to
eliminate duplicate rows was unnecessary given the relative benign presence of duplicate
rows (Codd, 1991, p. 18). Codd’s purer model based on mathematic principles gave way to
the more pragmatic needs of the commercial world.
2.2.2 Main Features of ‘true’ RDBMS
The main features of a Relational DBMS as proposed by Codd, distinguishes a ‘true’
Relational DBMS from other DBMS’. Based on his earlier paper setting out his 12 Rules
(1985), they are summarised as follows:
• Database information is values only and ordering is not essential (meta data while
required should not be of concern to the everyday user; pointers are not used)
• Data management is not dependant on position within the structure (contrast with
Hierarchal and Network models).
• Duplicate rows are not allowed.
• Information should be capable of being moved without impact on the user.
• Three level architecture of the RDBMS – base relations, storage, views (derived
tables).
• Declarations of domains as extended data types.
26. What is the future of the RDBMS in the Enterprise?
Page 14
• Column description should be akin to the domain it belongs to (i.e. a good naming
convention).
• Each base relation (R-Table) should have one and only one primary key column,
where null value entries are not allowed.
• RDBMS must allow one or more columns to be assigned as foreign keys.
• Relationships are based on comparing values from common domains.
This last point is crucial to understanding Codd’s intention. Only values from common
domains can be properly compared – currency with currency, euro with euro, date with date,
integer with integer etc. The basis for this lies with the nature of the mathematical operators
used in the system. Consistency of data types and strict rules are therefore vital for the
effective operation of the system. Herein lays one of the difficulties presented to designers of
commercial versions of Codd’s RDBMS. Users of data management systems are presented
with real world scenarios where consistency is not always practical. It would be ridiculous to
ask members of a social networking site to use standard forms for communicating so that the
DBMS could store the relevant information appropriately. Even closer to the relational
database world a transaction record could be created for a person called William Thomas as
follows:
Instance Surname Forename Address DOB ID Order No
1 Thomas William 22, Greenview Street 12/06/1945 1234 104
2 Thomas Bill 22 Greenview St. 12/06/1945 1365 104
3 Thomas William H. 22, Greenview Street 12/06/1945 3456 104
Table 2.2 – Example of redundant rows in a database
As can be seen in this simple example above, the database treats these as three distinct and
unique records, even though the intention is that only one record for this person should exist.
The result impacts on the size, processing speed and integrity of the system. Techniques to
address such problems (primarily data normalisation) were developed almost from the
beginning, in the early 1970’s by Codd and later by Raymond Boyce and Codd (Elmasri and
27. What is the future of the RDBMS in the Enterprise?
Page 15
Navathe, 1989, p. 371). Database normalisation is beyond the scope of this dissertation,
however the salient point and (and the reason for our initial hypothesis) is that the nature and
amount of unstructured data flowing in the electronic ether has pushed RDBMS and its
associated control and optimisation processes to the limits of their capabilities.
Debashish Ghosh of Anshin Software while advocating the merits of non-relational models
nevertheless puts it fairly…
“A relational data management system (RDBMS) engine is the right tool for handling
relational data used in transactions requiring atomicity, consistency, isolation, and
durability (ACID). However, an RDBMS isn’t an ideal platform for modelling
complicated social data networks that involve huge volumes, network partitioning,
and replication”. (Ghosh, 2010)
The above discussion is intended to provide an important distinction between Edgar Codd’s
original theory of a relational data management system and subsequent versions developed
for the commercial enterprise market (mainframe computer market at that time). The
importance of the mathematical principles (Relational Algebra and Calculus) behind Codd’s
ideas are not underestimated, nor are the associated operations based upon those principles, in
fact they are key to understanding why Codd at the time persisted in pushing for a full and
true implementation of his model, and it may also explain also why he stepped back from the
first experiments in commercialising his ideas (Chamberlin and Blasgen in McJones, 1997 p.
13). Brevity here forces us to move on to look at two of the earliest commercial versions of
RDBMS that by no accident are also the two market leaders today.
As an aside, Appendix 1 presents of useful comparison of the key terms from Codd’s original
intended meaning and their relationship to other systems.
2.2.3 IBM, Ellison and the University of California, Berkley
IBM
One artefact cited several times in this section on the history of data management systems is a
transcript from a reunion meeting in 1995 of some of the original IBM research employees,
who during the 1970s and 1980s were at the coal face of data management development. The
article edited by Paul McJones is entitled “The 1995 SQL Reunion: People, Projects, and
28. What is the future of the RDBMS in the Enterprise?
Page 16
Politics” (McJones, 1997). At first, what seems like the convivial reminiscences of middle
aged ex IBM colleagues, in fact turns out to be a rather more interesting illumination of the
context around the timelines for the development of some of the most important ideas to
emerge, as well as the historically important players and products from the realm of database
management. Some of the key people attending the reunion and contributing to the discussion
are: Donald Chamberlin, Jim Gray, Raymond Lorie, Gianfranco Putzolu, Patricia Selinger,
and Irving Traiger. All are IBM and ACM Fellows and award winners for their work. Jim
Gray, fellow Berkley graduate and mentor to Michael Stonebraker was given the ACM
Turing Award in 1998 for his work on transaction processing (ACID) (Stonebraker, 2008).
Patricia Selinger was awarded the ACM Edgar Codd Innovation Award for her work in query
optimisation. Their contributions were vital to the features of commercial RDBMS which has
ensured its longevity thus far and possibly for many years yet.
IBM and System R
Midway through the 1970s IBM’s San Jose based research lab began working on a project
called System R. Like many IBM research projects at the time it came out of different task
groups working on related areas such as data language, data storage, optimisation, concurrent
users, and system recovery. System R was relational based and combined work from various
groups. System R as a commercial RDBMS was installed in Prat & Whitney Aircraft
Company in Hartford Connecticut in 1977 where it was used for inventory control. However,
IBM was not yet interested in releasing it as fully featured product. At that time the big IBM
cash cow was IMS (its mainframe Network model DBMS mentioned earlier). And the
research focus was on a project called Eagle – a replacement for IMS with all the new
features of recent discoveries. With the pressure off, the System R developers plugged away,
aiming it towards the lower midrange product line (Jolls in McJones, 1997, p. 31). Two
things happened at the time which resulted in the focus coming back on System R and getting
it ready for market (McJones, 1997, pgs 33-34). Firstly, IBM was starting to loose ground to
new mini computers (Gray in McJones, 1997, pg 20) and secondly the Eagle project was
hitting a wall. System R unlike Eagle was relational and already pitched towards the smaller
computer range. The System R star did not shine for long and it was replaced by DB2 with
Release 1 in 1980. IBM fully embraced relational DBMS with Release 2 around 1985 (Miller
in McJones 1997, p. 43). DB2 is IBM’s current offering and is mentioned again under the
section on the RDBMS market.
29. What is the future of the RDBMS in the Enterprise?
Page 17
The Birth of SQL
In and around the same time that System R was being developed, the language research team
at IBM, Relational Data Systems (RDS) took on Codd’s two mathematical based languages
for data management, relational algebra and relational calculus. By their own admission they
found these mathematical notations too abstract and complex for general use. They developed
a notation which they called SQUARE (Specifying Queries as Relational Expressions),
(Chamberlin in McJones et al., 1997 p. 11)
SQUARE had some odd subscripts so a regular keyboard could not be used. RDS further
developed it to be closer to common English words. They called the new version Structured
English Query Language or SEQUEL. The intention was to make interaction with databases
easier for non-programmers. However its biggest impact came later when Larry Ellison (co-
founder and CEO of Oracle) read the IBM published papers on SEQUEL and realised that
this query language could act as an intermediary between different systems (Chamberlin in
McJones et al., 1997 p. 15). It was the RDS team at IBM who renamed it to SQL following a
trademark challenge to the term SEQUEL from an aircraft company (McJones et al, 1997, p.
20)
INGRES
In parallel with the work going on at IBM, the University of California at Berkley had a
project developing a system called INGRES (short for Interactive Graphics Retrieval
System). Michael Stonebraker who was at Berkley in 1972 was developing a query language
called QUELL. Stonebraker knew fellow Berkley graduates at IBM San Jose and more
importantly knew of their work. INGRES used QUELL whereas IBM and Larry Ellison’s
project at Software Development Laboratories (later Oracle) used SQL. Subsequent off
spring of the INGRES family are Sybase and Postgre (post Ingres). Incidentally, Microsoft
struck a deal with Sybase to use their code for their new extended operating system.
Recalling that the Sybase people were brought up in the QUELL tradition under Stonebraker,
Microsoft preferred SQL. They eventually fell out and Microsoft who now owned the Sybase
code ended up developing Microsoft SQL Server (Gray in McJones, 1997 p. 56).
30. What is the future of the RDBMS in the Enterprise?
Page 18
Oracle
In 1977 Larry Ellison, Bob Miner and Ed Oates founded Software Development Laboratories
(SDL), the precursor to Oracle Corporation. SDL based its system on a technical paper in an
IBM journal (Oracle History, 2011). That was Edgar Codd’s 1970 seminal paper setting out
his model for a RDBMS (Traiger in McJones et al., 1997). SDL’s first contract was to
develop a database management system for the Central Intelligence Agency (CIA) - the
project was called ‘Oracle’. SDL finished that project a year early and used the time to
develop a commercial RDBMS putting together the work done by IBM research on relational
databases and as mentioned above another project on working on the query language called
SEQUEL. While Ellison and SDL benefited from the work done at IBM they still had to do
all the coding. The resulting product was faster and a lot smaller than IBM’s System R. The
first officially released version of Oracle was version 2 in 1979.
Brad Wade jokes about Edgar Codd’s influence on Oracle - on Codd being made an IBM
Fellow in 1976, “It’s the first time that I recall of someone being made an IBM Fellow for
someone else’s product” (Wade in McJones, 1997, pg 49.)
It appears that many new enterprises sprang from the well of knowledge existing at IBM
during the 1970’s and 1980’s. Had the IBM research units not had so much talent, nor not
allowed publication of key papers at the time, the database world might look very different
today. Patents on software were prohibited by IBM, and also in fact by Supreme Court law
until 1980 (Bocchino, 1995). According to Franco Putzolu, IBM Research at that time and up
until 1979 were “publishing everything that would come to mind” (in McJones, 1997, p. 16).
Mike Blasgen argues that the outside interest in the published research was one reason why
the corporate machine of IBM began to notice some of the lesser research projects (in
McJones, 1997 p. 16).
It is hoped that the above overview gives the reader some understanding of the related threads
that developed out of Charles Bachman’s initial work on data management systems, through
IBM via Edgar Codd and out into the wide world via IBM research department’s open
attitude to sharing knowledge, of which Larry Ellison’s Oracle benefited greatly. Berkley
played its role also in the providing a common alma mater for young enthusiastic developers
to discuss ideas. It is an interesting irony that when we think of ‘open source’ we envision a
31. What is the future of the RDBMS in the Enterprise?
Page 19
recent phenomenon, however, IBM during the 1970’s would appear to have been a little
more open, for whatever reasons, than is usually accredited to them.
2.3 New Databases
This section will explore the development of new DB’s that have emerged on the database
market over the past decade, and what impact these DB’s will have on the general database
market as a whole.
What are ‘New DB’s’?
Traditional databases rely on a relational model in order to function. That is, they follow a set
of rigid rules to ensure the integrity of the data in the database. Most RDBMS models follow
the set of rules, originally outlined by Edgar Codd (1970).
New NoSQL database models don’t follow all of the rules set down by Codd. While
RDBMS’ models follow the set of properties called ACID as previously stated, NoSQL
database models do not. They follow any number of database properties including BASE
(Basically Available, Soft state, Eventual consistency) (Cattell, 2011) and CAP (Consistency,
Availability and Partition tolerance).
Why the development of NoSQL model databases?
Development of NoSQL databases was as a result of the evolution of the World Wide Web,
and the desire of individuals and companies/organizations to generate data, large amounts of
it (White, 2010, p. 2). By collecting data, organizations then had extract value from that data
in order to be successful in whatever field they participated, in the future.
The problem organizations faced in extracting value from that data were twofold:
1. As storage capacities increased, the means of transferring the data to the drive(s) did
not keep up. Twenty years ago, a hard drive could store 1.3 GB of data, while the
speed at which the entirety of the data could be accessed was 4.4 MB per second;
about five minutes to access it all. Today, 1 TB hard drives are the norm, but access
32. What is the future of the RDBMS in the Enterprise?
Page 20
speeds are about 100 MB per second; an access speed decrease of a factor of 30
(White, 2010, p. 3).
A means of getting around this bottle neck was the introduction of disk arrays,
whereby data could written and read from multiple disks in parallel. The drawback to
this was the possibility of hardware failure, whereby a disk or machine would fail and
the data lost (White, 2010, p. 3). Redundancy (various options of RAID being the
most famous examples) solved some of these problems but not all (Patterson, 1988).
2. The second problem is that with multiple disks, relational database models, with their
inbuilt consistency requirements, are unable to access data quickly enough when the
data is spread across multiple disk drives. RDBMS systems may not be able to allow a
query to access certain data if that data is already in use by another program or user
(Chamberlin, 1976).
2.3.1 Features of NoSQL Databases
In order for a Database to be considered a NoSQL database, it first must not comply with the
entirety of ACID properties. Amongst the features that define NoSQL databases include
Scalability, Eventual Consistency and Low Latency (Dimitrov, 2010). A key feature of
NoSQL databases is a “shared-nothing” architecture. This means databases can replicate and
partition data across multiple servers. In turn, this allows the databases to support a large
number of simple read/write operations per second (Cattell, 2011).
Scalability
With traditional RDBMS systems, a database was usually required to scale up, that is, switch
over to a newer, larger capacity machine, if the database is to expand capacity (Cattell, 2011).
One of the features designed into some NoSQL databases is their ability to scale to large data
volumes without losing the integrity of the data. With NoSQL, as systems are required to
expand with an influx of additional data, they scale out by adding more machines to the data
33. What is the future of the RDBMS in the Enterprise?
Page 21
cluster. With this scaling, NoSQL systems can process data at a faster speed than RDBMS, as
they are capable of spreading the workload of the processing over numerous machines
(Cattell, 2011).
Eventual Consistency
Eventual Consistency was pioneered by Amazon using the Dynamo database. The purpose of
its introduction was to ensure High Availability (HA) and scalability of the data. Ultimately,
data that is fetched for a query is not guaranteed to be up-to-date, but all updates to the data
are guaranteed to be propagated to all copies of the data on all nodes of the cluster eventually
(Cattell, 2011).
This ensures that databases are accessible to programs and individuals whom wish to read or
modify data, without the constraints of being locked out of a database or data field while the
data is currently being updated or read, as is the case with RDBMS databases models.
Low Latency
Latency is an element of the speed of a network. It refers to any number of delays that
typically occur in the processing of data (Mitchell, no date). In the case of NoSQL databases,
it means that queries can access the data and return answers more quickly than RDBMS
because the data is distributed across multiple nodes of a cluster, instead of one machine.
This results in a faster response time. Causes for high latency in traditional RDBMS model
databases include the seek time of hard disks (Mitchell, no date), the speed of the network
cables that run on the machines, and the bad programming of queries (Stevens, 2004)
(Souders, 2009).
NoSQL database models
Unlike RDBMS models, NoSQL data models are often inconsistent. For storage purposes,
NoSQL databases have a number of data model categories, which are listed below:
Key-value Stores
Databases that have this model use a single key-value index for all the data. These systems
provide persistence mechanisms as well as additional functions such as replication, locking,
34. What is the future of the RDBMS in the Enterprise?
Page 22
transactions and sorting. NoSQL databases such as Voldemort and Riak use Multi-Version
Concurrency Control (MVCC) for updates. They update data asynchronously, so they cannot
guarantee consistent data (Cattell, 2011).
Key-value store databases can support traditional SQL functionality, such as the ability to
delete, insert and lookup operations (Cattell, 2011).
Document Stores
This model supports more complex data than key-value stores. They can support secondary
indexes and multiple types of documents per database. A number of database models using
this include Amazon’s SimpleDB and CouchDB
Document Store databases provide a querying mechanism for the data they contain using
multiple attribute values and constraints (Cattell, 2011).
Extensible Record Stores
Influenced by Google’s Bigtable, Extensible Record Store databases consist of rows and
columns, which are scaled across multiple nodes. Rows are split across nodes by ‘sharding’
the primary key. This means that querying a range of values does not have to go to every
node. Columns are distributed over multiple nodes by using ‘column groups’. This allows the
database customer to specify which columns are best stored together, which has the added
advantage of being able to be queried faster, as all the most appropriate data for a query is
most likely close at hand: e.g., name and address (Cattell, 2011).
The most famous examples of an Extensible Record Store database available, save Google’s
proprietary Bigtable, are HBase and Cassandra. Additional databases that use the model are
Hypertable, sponsored by Baidu (Hypertable, 2011), and PNUT (Yahoo Research, 2011).
35. What is the future of the RDBMS in the Enterprise?
Page 23
Graph Databases
A graph database maintains one single structure – a graph (Rodrieguez, 2010). A graph is a
flexible data structure that allows for a more agile and rapid style of development (Neo4J,
2011).
A graph database has three main attributes:
1. Node – the location of the machine in which the data is stored
2. Relationship – this is a label given to the data item, which determines which data
in the same or other node that the original data is related too.
3. Property – this is the attribute of the data. (Neubauer, 2010)
The purpose of graph databases is to quickly determine the relationships between different
items of data. Examples of graph databases include the Neo4j database and Twitter’s
FlockDB, which is used to join up the tweets between those who post them and all of their
followers (Weil, 2010).
2.3.2 Hadoop
Hadoop/MapReduce
Hadoop is a distributed database model originally developed by Doug Cutting at Yahoo
(White, 2010, p. 9), using Google’s proprietary Bigtable database as a model (Apache, 2011).
Throughout its short history, developers have added components that allow Hadoop to
process the data that it collects more efficiently
Hadoop contains a number of components that allow the system to scale to large clusters of
machines, without impacting the overall integrity of the data stored on those machines. The
main component of Hadoop is MapReduce.
MapReduce is a framework for processing large datasets that are distributed across multiple
nodes/servers. The ‘map’ part of the framework takes the original inputted data and partitions
the data, distributing the original input to different nodes. The individual nodes can then, if
necessary, redistribute the data again to other sub-nodes. MapReduce then applies the map
36. What is the future of the RDBMS in the Enterprise?
Page 24
function in parallel to every item in the dataset, producing a list of pairs for each query
(White, 2010, p. 19). The ‘reduce’ part of the framework then collects all of the common key
values, sums them up, and returns a single output for the keys and a value(s). The reduce
function, in effect, removes duplication within the system, allowing queries to return results
more speedily (White, 2010, p. 19).
Hadoop is designed for distributed data, with a dataset split between multiple nodes, if
necessary. If MapReduce must query data that is located on multiple nodes, then the map
function will map all the data for the query that is located on a single node, and return the
result. It will do the same query on all nodes that the relevant data is located on. The reduce
function will then take all those map results and reduce them down to single values, again to
return the query result(s) (White, 2010, p. 31).
Both functions are oblivious to the size of the dataset that they are working on. As such, they
can remain the same irrespective of the size of the dataset, large or small. Additionally, if you
double the input data, the job will run twice as slow; however, if you double the size of the
cluster, a job will run as fast as the original one (White, 2010, p. 6).
HDFS
HDFS is the file system that allows Hadoop to distribute data across multiple
nodes/machines. HDFS stores data in blocks, similar in fashion other file systems. However,
while other file systems have small sized blocks, HDFS, by default has large size blocks. This
is to reduce the number of seeks that Hadoop must make in order to return a query, speeding
up the process (White, 2010, p. 43).
2.3.2.1 Components of Hadoop
HBase
Based on Google’s Bigtable, HBase was developed by Chad Walters and Jim Kellerman at
Powerset. The purpose of the development of HBase was to give Hadoop a means of storing
large quantities of fault-tolerant data. It can also sit on top of Amazon’s Simple Storage
Service (S3) (Wilson, 2009). HBase was developed from the ground up to allow databases to
37. What is the future of the RDBMS in the Enterprise?
Page 25
scale just by adding more nodes – machines – to the cluster that HBase/Hadoop is installed
on. As it does not support SQL, it can do what an RDBMS database cannot; host data on
sparsely populated tables, located on clusters made from commodity hardware (White, 2010,
p. 411). The structure of HBase is designed with a ‘master node’, which has control of any
number of ‘slave nodes’, called Region Servers. The master node is responsible for assigning
regions of the data to the region servers, as well as being responsible for the recovery of data
in the event of a region server failing (White, 2010, p. 413). In addition to this setup, HBase
is designed with fault tolerance built in – HBase, thanks to HDFS, creates three different
copies of the data spread across different data nodes (Dimitrov, 2010).
Hive
Hive is a scalable data processing platform developed by Jeff Hammerbacher at Facebook
(White, 2010, p. 365). The purpose of Hive is to allow individuals whom have strong SQL
skills to run queries on data that is stored in HDFS.
When querying the dataset, Hive first tries to convert SQL queries into MapReduce jobs, as
well as custom commands that allow it to target different partitions within the HDFS dataset,
allowing users to query specific data within the Hadoop cluster (White, 2010, p. 514). This
allows Hive to provide users with a traditional query model from older RDBMS
environments within the newer distributed NoSQL database environments.
2.3.3 Cassandra
Cassandra is a fault tolerant, decentralised database that can be scaled and distributed across
multiple nodes (Apache, 2011 - Lakshman, 2008). Developed by Avinash Lakshman at
Facebook (Lakshman, 2008), Cassandra is now an open source project run by the Apache
Foundation (Apache, 2011).
Initially designed to solve a search indexing problem, Cassandra was designed to scale to
very large sizes across multiple commodity servers. Additionally, the ability to have no single
point of failure was built into the system (Lakshman, 2008). Since Cassandra was designed to
scale across multiple servers, it had to overcome the possibility of failure at any given
location within each server, such as the possibility of a drive failure.
38. What is the future of the RDBMS in the Enterprise?
Page 26
To guard against such a possibility, Cassandra was developed with the following functions:
Replication
Cassandra replicates data across different nodes when written too. When data is
requested, the system accesses the closest node that contains the data. This ensures
that data stored using Cassandra maintains High-Availability (HA), one of the core
attributes of a NoSQL database. Once data is written to a server, a duplicate copy of
the data is then written to another node within the database (Lakshman, 2008).
Eventual Consistency
Cassandra uses BASE to determine the consistency of the database. In order for data
to be accessible to users, an individual whom is reading the data accesses it on one
node. At the same time, another individual can be making changes to another copy of
the data on another node. As the data is replicated, newer versions of the data are
sitting on one node, while older versions are still active on other nodes (Apache wiki,
2011).
Users of Cassandra can also determine the level of consistency, allowing writes to add
or edit data to a single copy of the data in a node, or, if possible, to write to all copies
of the data across all nodes (Apache wiki, 2011).
Scalability
Data that is stored on Cassandra is scalable across multiple machines. Such elasticity
is possible because Cassandra allows the adding of additional machines to the cluster
when required (Apache, 2011).
39. What is the future of the RDBMS in the Enterprise?
Page 27
2.4 The market for RDBMS’ and Non-Relational DBMS’
2.4.1 Introduction
This section is to give an overview of the current market for both relational databases and
newer non-relational databases. This document will investigate both traditional vendor
database offerings as well as the proliferation over the past few years of a number of
community developed open source database offerings.
The Literature Review for determining the current market for both traditional relational
databases and ‘future’ non-relational databases utilised a variety of sources, including
Internet search queries to find relevant research material, as well as utilising the University of
Dublin (DU) library facilities to access academic and commercial research to which DU has
access to.
2.4.2. RDBMS Market
Today, many executives want business to grow based on data-driven decisions. As such,
analytics of data has become a valuable tool in Business Intelligence (BI). Many of the top
performing companies use analytics to formulate future strategies and guide them on the
implementation of day-to-day operations (LaValle et al, 2010). However, organisations are
gaining more and more data without the means of extracting value from that data (LaValle,
Hopkins, et al). This has resulted in a requirement for the adoption by companies of
enterprise solutions that can give an overview of the data being generated, using Online
Analytical Processing (OLAP) databases.
The Database Management Systems market is split into two segments; OnLine Transaction
Processing (OLTP) and OLAP / Data Warehousing (DW). OLTP systems are characterised
by the RDBMS options available from vendors in the market will generally target either of
these two segments.
The OLTP market targets clients that require fast query processing, maintaining of data
integrity in multi-access environments and a business model that has data measured by the
number of transactions per second that the database can handle. In an OLTP model database,
40. What is the future of the RDBMS in the Enterprise?
Page 28
there is an emphasis on detailed current data, with the schema to store the data being the
entity model BCNF ( Datawarehouse4u, 2009).
OLAP databases are characterised by a low volume of transactions, and are primarily
designed for data warehousing databases. As such, they are particularly useful for data
mining; whereby applications access the data to give an overview of current trends, business
performance and informational advantage. As such, OLAP databases are increasingly seen as
important for making Business Intelligence (BI) decisions (Feinberg, Beyer, 2010)
2.4.2.1 Vendor Offerings
Within the enterprise database market, the industry is dominated by a few big corporations
which include Oracle, IBM, Microsoft, Sybase and Teradata. Many of the database offerings
from these firms operate in the Data Warehousing sector, which contains most of the market
for enterprise database management systems. While the big players will have comprehensive
database offerings for their clients, the market is currently being disrupted by new entrants
whom are targeting niche areas, either focusing on performance issues related to their
offerings, or single-point offerings (Feinberg, Bayer, 2010).
Oracle
According to Gartner, Oracle is currently the No. 1 vendor of RDBMS’ worldwide (Gartner
in Graham et al, 2010), with a 50% share of the market for the year 2010 (Trefis, 2011). They
are forecast to improve this figure to 60% by 2016, driven by their sales of the Exadata
hardware platform. Leveraging the use of the high-end Exadata servers in conjunction with
Oracles’ database software is estimated to result in more efficient and faster Online
Transaction Processing (Graham et al, 2010).
Currently, Oracle generates 86% of revenues from its database software portfolio, with 8%
from its hardware portfolio. The future strategy of the company is to have clients purchase
complete systems – hardware and software – thus leveraging the power of the Exadata system
to get the most out of Oracle’s database technology. The result will be an increase in Oracle’s
revenues and its market share (Crane et al, 2011).
41. What is the future of the RDBMS in the Enterprise?
Page 29
IBM
IBM is one of the main vendors in the market, and is the only vendor that offers to its clients
an Information Architecture (IA) that spans all systems, which includes OLTP, DW, and
retirement of data (Optim tapes) (Henschen, 2011a). IBM’s main offering in the RDBMS
market is the DB2 database. DB2 runs on a number of platforms, including Unix, Linux and
Windows OS. DB2 can also run on the z/OS platform, where it is used to deploy applications
for SOA, CRM, DW and operational BI.
IBM’s RDBMS solutions are ranked no.2 behind Oracle worldwide (Finkle, 2008), however,
they are slowly losing market share to Microsoft and Oracle due to uncompetitive pricing for
their database as well as greater functionality that can be found from rival offerings.
Recently, IBM acquired Netezza (Evans, 2011), a company who provide a DW appliance
called TwinFin to clients. TwinFin is a purpose-built appliance that integrates servers, storage
and database into a single managed system (Netezza, 2011a). The reason IBM acquired
Netezza is the expected increase in revenues that Netezza will generate from its portfolio
(Dignan, 2010), as well as a lack of overlap in the customer base between IBM’s current
client list and that of Netezza (Henschen, 2011b). Additionally, the acquisition fits in with
IBM’s overall business analytics strategy, as IBM has marked BI as the key driver for IT
infrastructure needs (Gartner, 2010).
Microsoft
SQL Server from Microsoft is a complete database platform designed for applications of
various sizes. It can be deployed on normal servers as well as the ‘cloud’, allowing user
clients to scale SQL Server to their respective needs. Purely a software player, Microsoft
requires hardware partners to deploy its database offerings (Mackie, 2011).
Microsoft, however, finds itself more under threat from low-cost or ‘free’ open source
alternatives such as MySQL and PostgreSQL due to operating primarily in the low-end mid-
market segment (Finkle, 2008). As such, if its clients are looking at alternative options, SQL
Server may not be competitively priced for Microsoft to compete with open source RDBMS.
42. What is the future of the RDBMS in the Enterprise?
Page 30
SAP/Sybase
Sybase, recently acquired by SAP, has three main business areas: OLTP using the Sybase
ASE database, Analytic Technology using Sybase IQ, and, interestingly, Mobile Technology
(Monash, 2010). This deal was required by SAP as it was coming under increasing pressure
due to Oracle’s recent acquisition of SUN Microsystems, which gave Oracle a stronger focus
on integrated products based around databases, middleware and applications (Yuhanna,
2010).
The deal between SAP and Sybase gives both companies a lot of synergies – SAP finally
acquires an enterprise-class database in the form of Sybase IQ, which SAP can now offer to
its hundreds of client companies a database with columnar store and advanced compression
capabilities (Yuhanna, 2010).
A differentiator from SAP peers now comes with the acquisition of Sybase in the form of a
mobile offering. Sybase has a number of mobile products for enterprises, including the
Sybase Unwired Platform and iAnywhere Mobile Office suite. These technologies allow
companies to connect mobile devices to a number of back-end data sources (Sybase, 2011).
SAP now has the ability to offer its applications embedded in Sybase mobile platforms, using
the synergy between the two to improve its competitive advantage and expand to other
markets (Yuhanna, 2010). Indeed, efforts are now being made to cement Sybase’s lead in this
segment of the market, with an initiative to make the Android OS platform enterprise ready.
This involves porting Afaria, Sybase’s mobile device management and security solution, to
the Android platform (Neil, 2011). With the growth of Android now reaching 30% of the
smartphone market share in the United States (Warren, 2011), the future growth for Sybase in
the mobile enterprise market looks strong.
Finally, although big in the database market in the early 1990’s (Greenbaum, 2010), Sybase
has been considered the fourth database vendor behind Oracle, Microsoft and IBM for the
past decade. Its main market for Sybase’ OLTP offering, Sybase ASE, has been the financial
services sector, with little penetration in other enterprise sectors. It is expected that SAP will
make Sybase ASE more cost effective, and make another push in this segment of the market,
maybe at the expense of the big three (Yuhanna, 2010).
43. What is the future of the RDBMS in the Enterprise?
Page 31
Teradata
Teradata is a database vendor specialising in data warehousing and analytical applications
(Prickett Morgan, 2010). During the last year, it was considered the best placed amongst its
peers as a market leader in Data Warehousing (Feinberg, Bayer, 2011). This will be a hard
position for competitors to dislodge as products in the DW market are considered difficult to
replace (Bylund, 2011). Amongst its clients are multinational corporations such as 3M and
PayPal (Teradata, 2011).
One of Teradata’s products, the Teradata parallel database, designed for DW and OLAP
functions, has an update and support revenue stream, as well as additional functions that
customers are willing to pay for (Prickett Morgan, 2010).
However, Teradata specialises in a single area of the database market – DW and analytics
(Prickett Morgan, 2010). As such it is exposed to any weakness that may occur within that
segment of the market. The company’s recent acquisition of Aprimo, an enterprise marketing
firm with a strong emphasis on Marketing Research Management (MRM) and Campaign
Management (CM). CM is considered by some as mission critical, as it allows marketers to
unlock the value of customer data to develop multi-channel communications. Such an
acquisition adds value to Teradata’s product portfolio, without competing with Teradata’s
current product range, allowing the company to diversify its offerings to clients and future
customers (Vittal, 2010).
EMC/Greenplum
Greenplum, a DW and Analytics firm acquired by EMC in 2010, is the foundation of EMC’s
Data Computing division. Greenplum specialises in DW in the ‘cloud’, through its Chorus
platform (Greenplum, 2011).
EMC’s strategy for gaining market share is releasing a free community version of their
database for testing, with the intent that they eventually purchase a commercial licence. It’s
recently released ‘free’ Community Edition database, a heavily customised version of
PostgreSQL, is targeted at companies and developers for whom Greenplum’s previous
offering was not useful for creating parallel databases for DW and Analytics (Prickett
44. What is the future of the RDBMS in the Enterprise?
Page 32
Morgan, 2011). The purpose of the release is to allow developers to build and test Massive
Parallel Processing (MPP) databases. If in the event that clients who develop these systems
wish to use the software in a commercial environment, then they will be required to purchase
a licence for the Greenplum Grade 4.0 database, EMC’s commercial DW offering
(Kanaracus, 2011).
It is hoped by EMC that customers wishing to have greater functionality with Greenplum’s
database will upgrade to the Greenplum Grade 4.0 database (Kanaracus, 2011).
2.4.3 Non-RDBMS Market
Open Source Databases
There are a number of open source community developed database solutions available on the
market today. However, due to these offerings generally being ‘free’, they don’t show up
high on the list of databases in use by revenues earned – total deployment of open source
databases can rival the total number of deployments from traditional vendors (Von Finck,
2009).
All RDBMS applications hold a consistency model that can be inflexible for certain
applications. The requirement for a record or table to be locked out from being viewed or
otherwise accessed while changes are being made slows down queries that are attempting to
generate results for end-users.
Additionally, due to atomicity and consistency, not all RDBMS applications are scalable to
the requirements of organisations that hold large quantities of data, such as Google and
Facebook.
With databases now employed that have tables of sizes in excess of 10 TB, the ability to
query all that data will require speed and processing power that cannot be achieved to the
requirements of user companies by traditional RDBMS offerings. Newer non-relational
database offerings designed to meet these new requirements usually come in two options;
MPP systems and Column-Store databases (Henschen, 2010).
45. What is the future of the RDBMS in the Enterprise?
Page 33
With the introduction of the Bigtable Distributed Storage System on top of the Google File
System (GFS) in 2006 (Chang, et al, 2006), Google has demonstrated that non-relational
databases can be scalable over multiple machines. Due to Bigtable’s proprietary nature
however, efforts have been made over the past five years to develop open source versions of
Google’s software, resulting in the arrival of the Apache Foundation’s Hadoop, initially
developed by Yahoo (Bryant and Kwan, 2008). A number of companies have now utilised
Hadoop and associated software to allow themselves to scale their database offerings to their
own requirements.
The growth of Hadoop can be inferred by unusual avenues. From 2007 through to early-mid
2009, IT requirements for expertise in Hadoop or MapReduce within the London area was .4
of 1% of the jobs market. By January 2011, the figure had grown to 1.2%, a 300% increase in
the requirement for expertise within 2 years (IT Jobs Watch, 2011). Additionally, there was a
49% increase in Hadoop job postings in the United States from 2008 to 2009, with most of
the job offerings being in California (Lorica, 2009).
However, due to the lack of suitably qualified engineers for Hadoop and HBase within the
industry at present, development projects at a number of companies have been affected due to
the lack of staff. Within Silicon Valley, Google and Facebook are two companies that can
afford to remunerate staff competitively due to their large sources of revenue. This has
resulted in Cloudera, the Start-up cloud database company, being unable to offer top
engineers remuneration at similar levels to their competitors. Cloudera have had to be
imaginative in relation to its remuneration to staff. This includes setting up offices within
downtown San Francisco, with the intention that staff would prefer to work in that location
than Palo Alto or Mountain View, both 30 miles from the centre of San Francisco (Metz,
2011a).
Such constraints will result in a lack of projects for new NoSQL databases until an adequate
supply of qualified engineers become available, slowing growth for development and
adoption of this new technology for the foreseeable future.
46. What is the future of the RDBMS in the Enterprise?
Page 34
Cassandra
Cassandra is a distributed, column family database, developed at Facebook to solve an Inbox
Search problem (Lakshman, 2008). It is now an open sourced project from the Apache
Foundation (Apache, 2011).
In addition to Facebook, additional users of the Cassandra database include the social news
website Digg (Higginbotham, 2010), who decided to switch from MySQL to Cassandra due
to scalability issues with MySQL. The rational behind the move was the decentralised nature
of Cassandra and the fact that it has no single point of failure (Kerner, 2010). Unfortunately,
the changeover to Cassandra was not run smoothly, resulting in Digg having to revert to
MySQL to ensure data integrity, and allow its services to be available to its clients. The
episode highlighted the pitfalls of switching from one architecture framework to another
(Woods, 2010).
Taking advantage of Cassandra’s introduction to the market, is Datastax – formerly Riptano
(DBMS2, 2011), a start-up founded by the Cassandra project’s chair, Jonathan Ellis. The
purpose of Datastax is to take commercial advantage of Cassandra, by selling expertise and
technical support in Cassandra (Kerner, 2010), following the examples of Red Hat (Linux)
and Cloudera (Cloud Computing) (Subramanian, 2010).
HBase
HBase is a non-relational database built on top of the Hadoop framework, using the Hadoop
Distributed File System (HDFS). Originally developed out of a need to process large amounts
of data, HBase is now a top-level Apache Foundation project (Zawodny, 2007).
Due to HBases’ ability to scale to large sizes, the database has received attention within IT as
a platform that can meet various companies’ requirements. Recent corporate announcements
about their deployment of HBase, has increased the marketplace viability of HBase as a
NoSQL database option (Metz, 2011b). These include both Facebook and Yahoo, 2
companies with large repositories of data.
Facebook announced a new messaging platform, in which email, text messages and Instant
Messages (IM), as well as Facebooks’ own messaging system, would be integrated together
(Metz, 2010). Facebook experimented with a number of database offerings, including its own
Cassandra database to see if it could handle the new system. Additionally, they excluded
47. What is the future of the RDBMS in the Enterprise?
Page 35
MySQL due to scalability issues. Eventually, they chose HBase, due to its consistency, as
well as ability to scale across multiple machines (Muthukkaruppan, 2010).
HBase was deployed by Yahoo to handle its news aggregation algorithm. The purpose of the
new system is to data-mine content in order to optimise what the viewer sees on Yahoo’s web
portal. In order for Yahoo to deploy to the website front page the most relevant news stories
that people are viewing at any given moment in time, their requirement for the system was a
database that could quickly query in real-time the most relevant items that people are
interested in based on the number of clicks that story receives. Deployment of this new
system has resulted in an increase in traffic to the Yahoo web portal, and subsequently
resulted in an increase in revenues (Metz, 2008).
48. What is the future of the RDBMS in the Enterprise?
Page 36
2.5 Case Studies
2.5.1 Case Study 1- Utility companies and the data management challenge
Introduction
Utility companies are known to be one of the most conservative of enterprises when it comes
to investing in technology (Fink, 2010; Fehrenbacher, 2010). There are many reasons for why
this might be so; security of supply, regulatory compliance and financial austerity together
with a lack of business drivers often leaves the risk averse utility threading water when it
comes to IT investment (Tony Giroti, CEO Bridge Energy, 2011). However, things have been
changing over the last few years. According to recent research by Lux, utilities (mainly
power and water) will invest up to $34 billion in technology by the year 2020 (St. John,
2011). The reason arises from Smart Grid projects mainly and the growing avalanche of
associated data which utilities will need to manage (St. John, 2011). For utilities, the business
drivers required to justify investment in the kind of technology which enables integration of
data across key business units have only recently emerged. Real-time applications just
weren’t necessary before now (Giroti, 2011).
Utilities
History has shown how utilities are by and large reactionary when it comes to new ideas. For
example, a snapshot of energy utilities related articles in the Pro Quest database (available
through the TCD Library’s online resources) at various times over the last few decades shows
flurries of activity around key moments of change in the industry. Cyclical changes from
regulation to de-regulation of the energy sector in the early 1990s, begun in the US, kick-
started reactionary strategy changes within the energy industry. Ireland followed the pattern
with the Electricity Regulation Act of 1999 a program which is nearing completion. Fifty six
articles on related subjects between 1992 and 1994 in contrast to just eighteen in following
six years to the year 2000 (Pro Quest database) would seem to support this assertion.
In the last decade or so innovation for utilities centred around the technology enabling Smart
Grid and again an upsurge in articles on this subject stands out in a normally ‘steady state’
49. What is the future of the RDBMS in the Enterprise?
Page 37
sector. More recently the pressures of a diminishing supply and subsequent higher prices of
raw material for energy production have propagated a sustainability drive.
Compliance however has been a steady influence on energy utilities. What makes the Smart
Grid attractive is the way it forces efficiency throughout the energy supply chain from
generation to distribution resulting in less CO2 emissions – a major deliverable of the Kyoto
agreement. Related to this has been the drive towards sustainable energy generation and
supply. Vice President of Technology at Cobb Energy, Bob Arnett sums it up:
“In today’s world, where utilities are focused on environmental concerns, resource
constraints, and intelligent grids, it is sometimes hard to remember that in the mid-
Nineties, the word of the day was ‘deregulation’.”
(Arnett, 2011)
This case study looks at utility companies in the context of these three key drivers:
Regulation/Deregulation; Smart Grid and Sustainability. The case is stated in general terms
initially but quickly moves to more specific Smart Grid applications in electricity supply
companies, focussing in one Irish energy company’s use of databases in its implementation of
Smart Grid applications. As the ESB’s (Electricity Supply Board) Tom Geraghty said of
Smart Metering in a recent interview with Silicon Republic:
“How you get data back from the electronic metre to a utility central point where it is
aggregated and the bill is sent out to simply allowing people to top up their metre at
home as if it were a mobile phone shows you the complexity that lies ahead. There are
many imaginative options emerging and the opportunities are endless,”
(in Kennedy, 2011)
One estimation from Lux research puts the increase of data coming from the Smart Grid at
900% by 2020 (St. John, 2011). Tony Giroti puts this in more tangible terms- 1 million smart
meters passing data every 15 min equates to 30 TB of data per year to be handled, stored and
harvested (Giroti, 2011). This figure doesn’t include the real time data flowing through the
system as part of the self-healing attribute of Smart Grids.
50. What is the future of the RDBMS in the Enterprise?
Page 38
The problem can be placed within the wider question asked in this dissertation, that is, what
is the future of the traditional RDBMS in the enterprise? To this end, this case study
predicates that the general feeling towards newer database management solutions such as
open source and NoSQL is that while they are attractive for certain non-core applications,
they are not yet up to the task of the more serious mission critical functions of control
systems, financial transactions and customer management within enterprises. This study
investigates the problem in the context of traditionally risk averse utility companies and
questions if new business drivers (of which the Smart Grid is key) are forcing a rethink on
this issue.
A public utility company is an enterprise which provides key services to the public most
typically electricity, gas, water, and transportation. They may be state or private owned. They
may operate in a regulated, deregulated or even semi-regulated market (Legal Dictionary).
The energy sector in Ireland is currently under going dramatic change. The two largest
energy companies in Ireland, the Electricity Supply Board (ESB) and Bord Gáis are
commercially run enterprises and are both majority owned by the state. Both companies have
recently entered into each others markets as a result of the state’s requirement (and driven by
the EU) to open up the energy market in an attempt to improve competitiveness the sector for
the benefit of consumers (Irish Government White Paper, 2007).
One result of this restructuring of the sector is that the separate electricity and gas markets
have been combined and the sector is now generally referred to as the energy market. The
functions carried out by utility companies differ according to the services they provide.
Energy suppliers are similar in the functions they carry out such as generation, transmission
and distribution of energy. Water utilities in other countries have moved towards a revenue
generating model for water supply and Ireland rightly or wrongly may soon follow suit.
Each core function contains a number of supporting IT applications. Each of these in turn is
supported by a suitable data management system. Some of the major solutions used in energy
utilities include: Geographical Information System (GIS); Meter Data Management (MDM);
Customer Information System (CIS); Distribution Management System (DMS); Supervisory
Control and Data Acquisition (SCADA); and Outage Management System (OMS). Figure 2.3
shows where some of these systems fit into the overall network.
51. What is the future of the RDBMS in the Enterprise?
Page 39
Each of these systems provides support the specific needs of the different business functions,
such as, supply, generation, distribution, trading, and operations. As such they may or may
not be integrated. In relation to meter data management (MDM) Giroti again states the
problem succinctly in his paper entitled “You’ve Got the Meter Data – Now What?” (2011),
where he gives two options:
Have a proactive strategy for integrating and managing data coming from the Grid, or...
Be reactive in response to problems as they appear at the risk of being left behind by
competitors adopting the former strategy.
Smart Grid - The ESB case
The European Technology Platform definition of smart grids is -
“electricity networks that can intelligently integrate the behaviour and actions of all users
connected to it - generators, consumers and those that do both – in order to efficiently deliver
sustainable, economic and secure electricity supplies” (Smart Grids: European Technology
Platform, 2010)
Successful smart grid implementation depends on how enterprises utilise information systems
in managing the torrent of data heading their way. This issue puts data management systems
right back in the foreground of the IT game.
The ESB plans to invest up to €11 billion in sustainable projects including a Smart Grid
(Strategy Framework 2020). The ESB began a pilot project for advanced metering in 2007.
Advanced meters occupy what is termed the head end of the smart grid. They reside on
customer premises or at the company’s own locations typically at the edge distribution
network. The ESB has to date installed 6,500 smart meters. The estimated total installations
required for full implementation is over two million. The data consists of messages to and
from a central management system called a meter data management system (MDM). The
message can be meter data relating to load readings, voltage and temperature measurements,
outages, faults and other events.
The ESB’s existing data management platforms includes solutions from Oracle, IBM and
Microsoft. Currently no open source or NoSQL solutions exist in any official way in the
company. A preliminary evaluation of the open source database solution MySQL was carried
52. What is the future of the RDBMS in the Enterprise?
Page 40
out by the IT department in 2010 but no decision on implementation has been made as yet.
MySQL is now under the roof of the Oracle house following its acquisition of Sun
Microsystems in 2010 (Lohr, 2009).
Image source: http://www.consumerenergyreport.com/wpcontent/uploads/2010/04/smartgrid.jpg
Figure 2.2 – Overview of a generic Smart Grid
53. What is the future of the RDBMS in the Enterprise?
Page 41
(Image source: EPRI)
Figure 2.3 - ESB proposed implementation of Advanced Metering (Key area of interest is circled)
The Data Volume Problem
A traditional electricity grid is made up of electro-mechanical components that link electricity
generation, transmission and distribution to consumers. A smart grid builds on advanced
digital SCADA devices involving two-way communication of data of interest to utilities,
consumers and government (Financial Times, Nov 2010).
Figures for how much data will flow vary depending on the implementation of smart grid.
Estimates from the ESB’s trials involving 6,500 meters show a substantial increase in the
amount of data required to be stored and analysed at the back end.
Utilities it seems are not immune to ‘Big Data’. Tony Giroti is qualified to comment on the
issue. He is one of only 13 elected members of Gridwise Architecture Council formed by the
US Department of Energy for the purpose of articulating the way forward for intelligent
energy systems.
In his article for the e-magazine Electric Energy Online “You’ve Got the Meter Data – Now
What?”, (2011), Giroti states the data volume problem as such:
54. What is the future of the RDBMS in the Enterprise?
Page 42
Figure 2.4 – Smart Meters transaction rate
Girotti foresees the storage and processing concerns associated with this volume of data.
Figure 2.5 – Smart Meters data size
Processing of this data also presents a challenge to system architects. Gathering of data from
a million smart-meters at 15-minute intervals as per the example above equates to 1,111
transactions per second, or 90 million transactions per day. The problem is further
compounded by the critical requirement of the system to analyse network event transactions
in real-time in responding to fluctuations in demand and fault response (Giroti, 2011).
One limitation of Girotti’s claim is that there is no indication in the article of how the one
kilobyte per transaction figure is calculated. This is an important factor for vendors of back
end processing running off relational databases. The lower this number is the better. Some
systems rely on filtering out less important data at the source, that is, at the meter itself rather
than storing superfluous data at the back end. For example, meter location information does
not change and can be sent only once. Even at a conservative data size of 128 bytes per
1 Million
Smart
Meters
Hourly Collections of
data =>
3.6Gigabytes of data
per day to be stored,
analysed and backed
up
1Kb per transaction
per meter = 1.1Mbs
1 Million
Smart
Meters
1 read every 15
mins
1 Million meter reads
15 mins x 60 secs
1,111
Transactions
per sec