SlideShare uma empresa Scribd logo
1 de 12
Baixar para ler offline
116                                                          Informatica Economică vol. 16, no. 2/2012



                 Distributed Parallel Architecture for "Big Data"

                   Catalin BOJA, Adrian POCOVNICU, Lorena BĂTĂGAN
                     Department of Economic Informatics and Cybernetics
                      Academy of Economic Studies, Bucharest, Romania
           catalin.boja@ie.ase.ro, pocovnicu@gmail.com, lorena.batagan@ie.ase.ro

This paper is an extension to the "Distributed Parallel Architecture for Storing and Pro-
cessing Large Datasets" paper presented at the WSEAS SEPADS’12 conference in Cam-
bridge. In its original version the paper went over the benefits of using a distributed parallel
architecture to store and process large datasets. This paper analyzes the problem of storing,
processing and retrieving meaningful insight from petabytes of data. It provides a survey on
current distributed and parallel data processing technologies and, based on them, will pro-
pose an architecture that can be used to solve the analyzed problem. In this version there is
more emphasis put on distributed files systems and the ETL processes involved in a distribut-
ed environment.
Keywords: Large Dataset, Distributed, Parallel, Storage, Cluster, Cloud, MapReduce,
Hadoop


1   Introduction
    A look back to the immediate history
shows that the data storing capacity is con-
                                                   years, figure 2. As a fact, in 1981 you must
                                                   use 200 Seagate units, each having a five
                                                   megabytes capacity and costing 1700$, to
tinuously increasing, while the cost per GB        store one gigabyte of data.
stored is decreasing, figure 1. The first hard
disk came from IBM in 1956, it was called
The IBM 350 Disk Storage and it had a ca-
pacity of 5 MB, [10-11]. In 1980, IBM 3380
breaks the gigabyte-capacity limit providing
storage for 2.52 GB. After 27 years, Hitachi
GST that acquired IBM drive division in
2003, deliver the first terabyte hard drive. Af-
ter only two years, in 2009, Western Digital
launches industry’s first two terabyte hard          Fig. 2. Average $ cost per GB (based on
drive. In 2011, Seagate introduces the                                [13])
world’s first 4TB hard drive, [11].
                                                   The reduce costs for data storage has repre-
                                                   sented the main premise of the current data
                                                   age, in which it is possible to record and
                                                   store almost everything, from business to
                                                   personal data. As an example, in the field of
                                                   digital photos, it is estimated that around the
                                                   world has been taken over 3.5 trillion photos
                                                   and in 2011 alone have been made around
                                                   360 billion new snapshots, [16]. Also, the
  Fig. 1. Average HDD capacity (based on           availability and the speed of Internet connec-
                   [13])                           tions around the world have generated in-
                                                   creased data traffic that is generated by a
In terms of price [12], the cost per gigabyte      wide range of mobile devices and desktop
decreased from an average of 300.000 $ to a        computers. Only for mobile data, Cisco ex-
merely an average of 11 cents in the last 30       pects that mobile data traffic will grow from
Informatica Economică vol. 16, no. 2/2012                                                            117


0.6 exabytes (EB) in 2011 to 6.3 EB in 2015,             ment a cluster analysis model.
[8]. The expansion of data communications
has promoted different data services, social,            2 Processing and storing large datasets
economic or scientific, to central nodes for             As professor Anand Rajaraman questioned,
storing and distributing large amounts of da-            more data usually beats better algorithm
ta. For example, Facebook social network                 [15]. The question is used to highlight the ef-
hosts more than 140 billion photos, which is             ficiency of a proposed algorithm for the Net-
more than double compared to 60 billion pic-             flix Challenge, [17]. Despite the statement is
tures at the end of 2010. In terms of storage,           still debatable, it brings up a true point. A
all these snapshots data take up more than 14            given data mining algorithm yields better re-
petabytes. In other fields, like research, the           sults with more data and it can reach the
Large Hadron Collider particle accelerator               same accuracy of results of a better or more
near Geneva, will produce about 15 petabytes             complex algorithm. In the end, the objective
of data per year. The SETI project is record-            of a data analysis and mining system is to
ing each month around 30 terabytes of data               process more data with better algorithms,
which are processed by over 250.000 com-                 [15].
puters each day, [14]. The supercomputer of              In many fields more data is important be-
the German Climate Computing Center                      cause it provides a more accurate description
(DKRZ) has a storage capacity of 60                      of the analyzed phenomenon. With more da-
petabytes of climate data. In the financial              ta, data mining algorithms are able to extract
sector, records of every day financial opera-            a wider group of influence factors and more
tions generate huge amounts of data. Solely,             subtle influences.
the New York Stock Exchange records about                Today, large datasets means volumes of hun-
one terabyte of trade data per day, [15].                dreds of terabytes or petabytes and these are
Despite this spectacular evolution of storage            real scenarios. The problem of storing these
capacities and of deposits size, the problem             large datasets is generated by the impossibil-
that arises is to be able to process it. This is-        ity to have a drive with that size and more
sue is generated by available computing                  important, by the large amount of time re-
power, algorithms complexity and access                  quired to access it.
speeds. This paper makes a survey of differ-             Access speed of large data is affected by the
ent technologies used to manage and process              disk speed performances [22], Table 1, inter-
large data volumes and proposes a distributed            nal data transfer, external data transfer, cache
and parallel architecture used to acquire,               memory, access time, rotational latency, that
store and process large datasets. The objec-             generate delays and bottlenecks.
tive of the proposed architecture is to imple-

                    Table 1. Example of disk drives performance (source [22])
             Interface        HDD             Average Internal transfer External transfer Cache
                             Spindle         rotational
                              [rpm]         latency [ms]   [Mbps]           [MBps]        [MB]
              SATA            7,200             11            1030             300        8 – 32
               SCSI           10,000         4.7 – 5.3        944              320          8
          High-end SCSI       15,000         3.6 – 4.0        1142             320        8 – 16
               SAS        10,000 / 15,000    2.9 – 4.4        1142             300         16


Despite the rapid evolution of drives capaci-            250 of them to store 1 PB of data. Once the
ty, described in figure 1, the large datasets of         storage problem is solved another question
up to one petabytes can only be stored on                arises, regarding how easy/hard is to read
multiple disks. Using 4 TB drives requires               that data. Considering an optimal transfer
118                                                         Informatica Economică vol. 16, no. 2/2012


rate of 300 MB/s then the entire dataset is
                                                                    (file name, chunk index)
read in 38.5 days. A simple solution is to
read from the all disks at once. In this way,         GFS Clients                              GFS Master
the entire dataset is read in only 3.7 hours.
                                                                    (chunk handle, chunk loca-
If we take into consideration the communica-                                  tions)
tion channel, then other bottlenecks are gen-
erated by the available bandwidth. In the end,                          Instructions to chunk server
the performance of the solution is reduced to
the speed of the slowest component.                                                      Chunk server state
Other characteristics of large data sets add
supplementary levels of difficulty:
 many input sources; in different econom-                  (chunk handle, byte range)         Chunk Serv-
    ic and social fields there are multiple                                                        ers
    sources of information;                                     (chunk data)
 redundancy, as the same data can be pro-
    vided by different sources;
 lack of normalization or data representa-
    tion standards; data can have different
    formats, unique IDs, measurement units;
 different degrees of integrity and con-
    sistency; data that describes the same            Fig. 3. Simplified GFS Architecture
    phenomenon can vary in terms of meas-
    ured characteristics, measuring units,        In a distributed approach, the file system
    time of the record, methods used.             aims to achieve the following goals:
For limited datasets the efficient data man-       it should be scalable; the file system
agement solution is given by relational SQL            should allow for additional hardware to
databases, [20], but for large datasets some of        be added to increase storing capacity
their founding principles are eroded [3], [20],        and/or performance.
[21].                                              it should offer high performance; the file
                                                       system should be able to locate the data
3 Proposed Approach                                    of interest on the distributed nodes in a
The next solutions provide answers to the              timely manner.
question regarding how to store and process        it should reliable; the file system should
large datasets in an efficient manner that can         be able to recreate from the distributed
justify the effort.                                    nodes the original data in a complete and
                                                       undistorted manner.
3.1 Large Data Sets Storage                        it should have high availability; the file
When accessing large data sets, the storage            system should account for failures and
file system can become a bottle neck. That's           incorporate mechanisms for monitoring,
why a lot of thought was put into redesigning          error detection, fault tolerance and auto-
the traditional file system for better perfor-         matic recovery.
mance when accessing large files of data.         Google, one of the biggest search engine ser-
                                                  vices providers, which handles “big web da-
                                                  ta” has published a paper about the file sys-
                                                  tem they claim it's being used by their busi-
                                                  ness called "The Google File System".
                                                  From an architecture point of view the
                                                  Google File System (GFS) comprises of a
                                                  "single master", multiple "chunk servers" and
                                                  it's being accessed by multiple "clients".
Informatica Economică vol. 16, no. 2/2012                                                     119


In the simplified version of the GFS architec-    ta into target data.
ture, figure 3 it shows how the GFS clients       The ETL process [32] is base on three ele-
communicate with the master and with the          ments:
chunk servers. Basically the clients are ask-      Extract – The process in which the data is
ing for a file and the GFS master tells them          read from multiple source systems into a
which chunk servers contain the data for that         single format. In this process data is ex-
file. Then the GFS clients make a request for         tracted from the data source;
the data at those respective chunk servers.        Transform – In this step, the source data
Then GFS chunk servers transmit the data di-          is transform into a format relevant to the
rectly to the GFS clients. No user data actual-       solution. The process transform the data
ly passes the GFS master, this way it avoids          from the various systems and made it
having the GFS master as a bottleneck in the          consistent;
data transmission.                                 Load – The transformed data is now writ-
Periodically, the GFS master communicates             ten into the warehouse.
with the chunk servers to get their state and     Usually the systems that acquire data are op-
to transmit them instructions.                    timized so that the data is being stored as fast
Each chunk of data is identified by an immu-      as possible. Most of the time comprehensive
table and globally unique 64 bit chunk handle     analyses require access to multiple sources of
assigned by the master at the time of chunk       data. It’s common that those sources store
creation [33].                                    raw data that yields minimal information un-
The chunk size is a key differentiator from       less properly process.
more common file system. GFS uses 64 MB           This is where the ETL or ELT processes
for a chunk, limiting this way the number or      come into play. An ETL process will take the
requests to the master for chunk locations, it    data, stored in multiple sources, transform it,
reduces the network overhead and it reduces       so that the metrics and KPIs are readily ac-
the metadata size on the master, allowing the     cessible, and load it in an environment that
master to store the metadata in memory.           has been modeled so that the analysis queries
Hadoop, a software framework derived from         are more efficient [23]. An ETL system is
Google's papers about MapReduce and GFS,          part of a bigger architecture that includes at
offers a similar file system, called Hadoop       least one Database Management System,
Distributed Files System (HDFS).                  DBMS. It is placed upstream from a DBMS
What GFS was identifying as "Master" it's         because it feeds data directly into the next
being called NameNode in HDFS and the             level.
GFS "Chunk Servers" can be found as               The ETL tools advantages are [29], [30],
"Datanodes" in HDFS.                              [31]:
                                                   save time and costs when developing and
3.2 ETL                                               maintaining data migration tasks;
The Extract, Transform and Load (ETL) pro-         use for complex processes to extract,
cess provides an intermediary transfor-               transform, and load heterogeneous data
mations layer between outside sources and             into a data warehouse or to perform other
the end target database.                              data migration tasks;
In literature [29], [30] we identified ETL and     in larger organizations for different data
ELT. ETL refers to extract, transform and             integration and warehouse projects ac-
load in this case activities start with the use       cumulate;
of applications to perform data transfor-          such processes encompass common sub-
mations outside of a database on a row-by-            processes, shared data sources and tar-
row basis, and on the other hand ELT refers           gets, and same or similar operations;
to extract, load and transform which implied       ETL tools support all common databases,
the use first the relational databases, before        file formats and data transformations,
performing any transformations of source da-          simplify the reuse of already created
120                                                          Informatica Economică vol. 16, no. 2/2012


     (sub-)processes due to a collaborative de-   termediate subsets associated with the same
     velopment platform and provide central       intermediate key and send them to the Re-
     scheduling.                                  duce function.
 Portability: usually the ETL code can be        The Reduce function, also accepts an inter-
     developed one a specific target database     mediate key and subsets. This function merg-
     and ported later to other supported data-    es together these subsets and key to form a
     bases.                                       possibly smaller set of values. Normally just
When developing the ETL process, there are        zero or one output value is produced per Re-
two options: either takes advantage of an ex-     duce function.
isting ETL tool, some key players, in this        In [26] is highlight that many real world
domain, are: IBM DataStage, Ab Initio,            tasks such used MapReduce model. This
Informatica or custom code it. Both ap-           model is used for web search service, for
proaches have benefits and pitfalls that need     sorting and processing the data, for data min-
to be carefully considered when selecting         ing, for machine learning and for a big num-
what better fits the specific environment.        ber of other systems.
The benefits of an in-house custom built ETL      The entire framework manages how data is
process are:                                      split among nodes and how intermediary
 Flexibility; the custom ETL process can         query results are aggregate.
     be designed to solve requirements specif-    A general MapReduce architecture can be il-
     ic to the organization that some of the      lustrated as Figure 4.
     ETL tools may have limitation with;
 Performance; the custom ETL process
     can be finely tuned for better perfor-
     mance;
 Tool agnostic; the custom ETL process
     can be built using skills available in-
     house;
 Cost efficient: custom ETL processes
     usually use resources already available in
     the organization, eliminating the addi-
     tional costs with licensing an ETL tool
     and training the internal resources on us-
     ing that specific ETL tool.

3.3 MapReduce, Hadoop and HBase
MapReduce (MR) is a programming model
and an associated implementation for pro-
cessing and generating large data sets, [18].
The model was developed by Jeffrey Dean             Fig. 4. MapReduce architecture (based on
and Sanjay Ghemawat at Google. The foun-                            [28])
dations of the MapReduce model are defined
by a map function used top process key-value      The MR advantages are [1], [6], [7], [18],
pairs and a reduce functions that merges all      [20], [24], [27]:
intermediate values of the same key.               the model is easy to use, even for pro-
The large data set is split in smaller subsets        grammers without experience with paral-
which are processed in parallel by a large            lel and distributed systems;
cluster of commodity machines.                     storage-system independence as it not re-
Map function [27] takes an input data and             quires proprietary database file systems
produces a set of intermediate subsets. The           or predefined data models; data is stored
MapReduce library groups together all in-             in plain text files and it is not required to
                                                      respect relational data schemes or any
Informatica Economică vol. 16, no. 2/2012                                                     121


    structure; in fact the architecture can use      query predicate push down via server
    data that has an arbitrary format;                side scan and get filters;
 fault tolerance;                                 optimizations for real time queries ;
 the framework is available from high lev-        a Thrift gateway and a REST-ful Web
    el programming languages; one such so-            service that supports XML, Protobuf, and
    lution is the open-source Apache Hadoop           binary data encoding options
    project which is implemented in Java;          extensible JRuby based (JIRB) shell;
 the query language allows record-level           support for exporting metrics via the
    manipulation;                                     Hadoop metrics subsystem to files or
 projects as Pig and Hive [34] are provid-           Ganglia; or via JMX.
    ing a rich interface that allows program-     HBase database stores data in labeled tables.
    mers to do join datasets without repeating    In this context [28] the table is designed to
    simple MapReduce code fragments;              have a sparse structure and data is stored in
Hadoop is a distributed computing platform,       table rows, and each row has a unique key
which is an open source implementation of         with arbitrary number of columns.
the MapReduce framework proposed by
Google [28]. It is based on Java and uses the     3.4 Parallel database systems
Hadoop Distributed File System (HDFS).            A distributed database (DDB) is a collection
HDFS is the primary storage system used by        of multiple, logically interconnected data-
Hadoop applications. It is uses to create mul-    bases distributed over a computer network. A
tiple replicas of data blocks for reliability,    distributed database management system,
distributing them around the clusters and         distributed DBMS, is the software system
splitting the task into small blocks. The rela-   that permits the management of the distribut-
tionship between Hadoop, HBase and HDFS           ed database and makes the distribution trans-
can be illustrated as Figure 5.                   parent to the users. A parallel DBMS is a
HBase is a distributed database. HBase is an      DBMS implemented on a multiprocessor
open source project for a database, distribut-    computer. [21]. The parallel DBMS imple-
ed, versioned, column-oriented, modeled af-       ments the concept of horizontal partitioning
ter Google’ Bigtable [25].                        [24] by distributing parts of a large relational
                                                  table across multiple nodes to be processed in
                     HDFS                         parallel. This requires a partitioned execution
                                       Output     of the SQL operators. Some basic operations,
    Output                                        like a simple SELECT, can be executed in-
                                                  dependently on all the nodes. More complex
             Input             Input              operations are executed through a multiple-
                                                  operator pipeline. Different multiprocessor
                                                  parallel system architectures [21], like share-
                      Input
                                                  memory, share-disks or share nothing, define
    Hadoop                               HBase    possible strategies to implement a parallel
                     Output                       DBMS, each with its own advantages and
                                                  drawbacks. The share-nothing approach dis-
  Fig. 5. The relationship between Hadoop,        tributes data across independent nodes and
      HBase and HDFS (based on [28])              has been implemented by many commercial
                                                  systems as it provides extensibility and avail-
Some as the features of HBASE as listed at        ability.
[25] are:                                         Based on the above definitions, we can con-
 convenient base classes for backing             clude that parallel database systems improve
    Hadoop MapReduce jobs with HBase ta-          performance of data processing by paralleliz-
    bles including cascading, hive and pig        ing loading, indexing and querying data. In
    source and sink modules;                      distributed database systems, data is stored in
122                                                          Informatica Economică vol. 16, no. 2/2012


different DBMSs that can function inde-            also reduces the administration effort as it
pendently. Because parallel database systems       comes as a single vendor out of the box solu-
may distribute data to increase the architec-      tion. The data warehouse appliances offer
ture performance, there is a fine line that sep-   high availability through built-in fail-over
arates the two concepts in real implementa-        capabilities using data redundancy for each
tions.                                             disk.
Despite the differences between parallel and       Ideally, each processing unit of the data
distributed DBMSs, most of their advantages        warehouse appliance should process the same
are common to a simple DBMS, [20],[35]:            amount of data at any given time. To achieve
 stored data is conform to a well-defined          that, the data should be distributed uniformly
    schema; this validates the data and pro-       across each processing unit. Data skew is a
    vides data integrity;                          measure to evaluate how data is distributed
 data is structured in a relational paradigm      across each processing unit. A data skew of 0
    of rows and columns;                           means that the same number of records is
 SQL queries are fast;                            distributed on each processing unit. A data
 the SQL query language is flexible, easy         skew of 0 is ideal.
    to learn and read and allows program-          By having each processing unit do the same
    mers to implement complex operations           amount of work it ensures that all processing
    with ease;                                     units finish their task about the same time,
 use hash or B-tree indexes to speed up           minimizing any waiting times.
    access to data;                                Another aspect that has an important impact
 can efficiently process datasets up to two       on the query performance is having the all
    petabytes of data.                             the data that is related on the same pro-
Known commercial parallel databases as Te-         cessing unit. This way the time required to
radata, Aster Data, Netezza [9], DATAllegro,       transfer data between the processing units is
Vertica, Greenplum, IBM DB2 and Oracle             eliminated. For example, if the user requires
Exadata, have been proven successful be-           the sales by country report, having both the
cause:                                             sales data for a customer and his geographic
 allow linear scale-up, [21]; the system          information on the same processing unit will
    can maintain constant performance as the       ensure that the processing unit has all the in-
    database size is increasing by adding          formation that it needs and each processing
    more nodes to the parallel system;             unit is able to perform its tasks independent-
 allow linear speed-up, [21]; for a data-         ly.
    base with a constant size, the perfor-         The way data is distributed across the parallel
    mance can be increased by adding more          database nodes influence the overall perfor-
    components like processors, memory and         mance. Though the power of the parallel
    disks;                                         DBMS is given by the number of nodes, this
 implement inter-query, intra-query and           can be also a drawback. For simple queries
    intra-operation parallelism, [21];             the actual processing time can be much
 reduced implementation effort;                   smaller to the time needed to launch the par-
 reduced administration effort;                   allel operation. Also, nodes can become hot
 high availability.                               spots or bottle necks as they delay the entire
In a massively parallel processing architec-       system.
ture (MPP), adding more hardware allows for
more storage capacity and increases queries        4 Proposed architecture
speeds. MPP architecture, implemented as a         The proposed architecture is used to process
data warehouse appliance, reduces the im-          large financial datasets. The results of the da-
plementation effort as the hardware and            ta mining analysis help economists to identi-
software are preinstalled and tested to work       fy patterns in economic clusters that validate
on the appliance, prior to the acquisition. It     existing economic models or help to define
Informatica Economică vol. 16, no. 2/2012                                                     123


new ones. The bottom-up approach is more           It is important that the ETL server also sup-
efficient, but more difficult to implement,        ports parallel processing allowing it to trans-
because it can suggests, based on real eco-        form large data sets in timely manner. ETL
nomic data, relations between economic fac-        tools like Ab Initio, DataStage and
tors that are specific to the cluster model. The   Informatica have this capability built in. If
difficulty comes from the large volume of          the ETL server does not support parallel pro-
economic data that needs to be analyzed.           cessing, then it should just define the trans-
The architecture, described in figure 6, has       formations and push the processing to the
three layers:                                      target parallel DBMS.
 the input layer implements data acquisi-
    tion processes; it gets data from different
    sources, reports, data repositories and ar-
    chives which are managed by govern-
    mental and public structures, economic
    agencies and institutions, NGO projects;
    some global sources of statistical eco-
    nomic data are Eurostat, International
    Monetary Fund and World Bank; the
    problem of these sources is that they use
    independent data schemes and bringing
    them to a common format it is an inten-
    sive data processing stage taking into
    consideration national data sources or
    crawling the Web for free data, the task
    becomes a very complex one;
 the data layer stores and process large da-
    tasets of economic and financial records;
    this layer implements distributed, parallel
    processing;
 the user layer provides access to data and
    manage requests for analysis and reports.
The ETL intermediary layer placed between
the first two main layers, collects data from
the data crawler and harvester component,
converts it in a new form and loads it in the
parallel DBMS data store. The ETL normal-
ize data, transforms it based on a predefined
structure and discards not needed or incon-
sistent information.                                       Fig. 6. Proposed architecture
The ETL layer inserts data in the parallel dis-
tributed DBMS that implements the Hadoop           The user will submit his inquiries through a
and MapReduce framework. The objective of          front end application server, which will con-
the layer is to normalize data and bring it to a   vert them into queries and submit them to the
common format, requested by the parallel           parallel DBMS for processing.
DBMS.                                              The front end application server will include
Using an ETL process, data collected by the        an user friendly metadata layer, that will al-
data crawler & harvester gets consolidated,        low the user to query the data ad-hoc, it will
transformed and loaded into the parallel           also include canned reports and dashboards.
DBMS, using a data model optimized for da-         The objective of the proposed architecture is
ta retrieval and analysis.                         to separate the layers that are data processing
124                                                         Informatica Economică vol. 16, no. 2/2012


intensive and to link them by ETL services           data schema in order to obtain a full de-
that will act as data buffers or cache zones.        scription of the data; the same schema is
Also the ETL will support the transformation         used to validate data;
effort.                                            DBMSs use B-trees indexes to achieve
For the end user the architecture is complete-       fast searching times; indexes can be de-
ly transparent. All he will experience is the        fined on any attributes are managed by the
look and feel of the front end application.          system; MR framework does not imple-
Based on extensive comparisons between               ment this built-in facility and an imple-
MR and parallel DBMS, [6], [20], [24], we            mentation of a similar functionality is
conclude that there is no all-scenarios good         done by the programmers who control the
solution for large scale data analysis because:      data fetching mechanism;
 both solutions can be used for the same          DBMSs provide high level querying lan-
   processing task; you can implement any            guages, like SQL, which are ease to read
   parallel processing task based either on a        and write; instead the MR use code frag-
   combinations of queries or a set of MR            ments, seen as algorithms, to process rec-
   jobs;                                             ords; projects as Pig and Hive [34] are
 the two approaches are complementary;              providing a rich interface based on high–
   each solution gives better performance            level programming languages that allows
   over the other one in particular data sce-        programmers to reuse code fragments.
   narios; you must decide with approach           parallel DBMSs have been proved to be
   saves time for the needed data processing         more efficient in terms of speed but they
   task;                                             are more vulnerable to node failures.
 the process of configuring and loading da-      A decision which approach to take must be
   ta into the parallel DBMS is more com-         made taking into consideration:
   plex and time consuming than the setting        performance criteria;
   up of the MR architecture; one reason is        internal structure of processed data;
   that the DBMS requires complex schemas          available hardware infrastructure;
   to describe data, whereas MR can process        maintenance and software costs.
   data in arbitrary format;                      A combined MR-parallel DBMS solution,
 the performance of the parallel DBMS is         [36] can be a possibility as it benefits from
   given by system fine tune level; the sys-      each approach advantages.
   tem must be configured accordingly to the
   tasks needed to complete and to the avail-     5 Conclusion
   able resources;                                Processing large datasets obtained from mul-
 common MR implementations take full             tiple sources is a daunting task as it requires
   advantage of the reduced complexity by         tremendous storing and processing capaci-
   processing data with simple structure, be-     ties. Also, processing and analyzing large
   cause the entire MR model is built on the      volumes of data becomes non-feasible using
   key-value pair format; a MR system can         a traditional serial approach. Distributing the
   be used to process more complex data, but      data across multiple processing units and
   the input data structure must be integrated    parallel processing unit yields linear im-
   in a custom parser in order to obtain ap-      proved processing speeds.
   propriate semantics; not relaying on a         When distributing the data is critical that
   common recognized data structure has an-       each processing unit is allocated the same
   other drawback as data is not validated by     number of records and that all the related da-
   default by the system; this can conduct to     ta sets reside on the same processing unit.
   situations in which modified data violates     Using a multi-layer architecture to acquire,
   integrity or other constraints; in contrast,   transform, load and analyze the data, ensures
   the SQL query language used by any par-        that each layer can use the best of bread for
   allel DBMS, takes full advantage of the        its specific task. For the end user, the experi-
Informatica Economică vol. 16, no. 2/2012                                                      125


ence is transparent. Despite the number of               Update,                       2010–2015,
layers that are behind the scenes, all that it is        http://www.cisco.com/en/US/solutions/c
exposed to him is a user friendly interface              ollateral/ns341/ns525/ns537/ns705/ns82
supplied by the front end application server.            7/white_paper_c11-520862.html
In the end, once the storing and processing         [9] D. Henschen, “New York Stock Ex-
issues are solved, the real problem is to                change Ticks on Data Warehouse Appli-
search for relationships between different               ances,”       InformationWeek,      2008,
types of data [3]. Others has done it very               http://www.informationweek.com/news/
successfully, like Google in Web searching               software/bi/207800705
or Amazon in e-commerce.                            [10] IBM, IBM 350 disk storage unit,
                                                         http://www-
Acknowledgment                                           03.ibm.com/ibm/history/exhibits/storage
This work was supported from the European                /storage_profile.html
Social Fund through Sectored Operational            [11] Wikipedia, History of IBM magnetic
Programmer Human Resources Development                   disk                               drives,
2007-2013,           project          number             http://en.wikipedia.org/wiki/History_of_
POSDRU/89/1.5/S/59184, “Performance and                  IBM_magnetic_disk_drives
excellence in postdoctoral research in Roma-        [12] Wikipedia, History of hard disk drives,
nian economics science domain”.                          http://en.wikipedia.org/wiki/History_of_
                                                         hard_disks
References                                          [13] I. Smith, Cost of Hard Drive Storage
[1] T. White, Hadoop: The Definitive Guide,              Space,
    O’Reilly, 2009                                       http://ns1758.ca/winch/winchest.html
[2] G. Bruce Berriman, S. L. Groom, “How            [14] Seti@home,           SETI         project,
    Will Astronomy Archives Survive the                  http://setiathome.berkeley.edu/
    Data Tsunami?,” ACM Queue, Vol. 9               [15] A. Rajaraman, More data usually beats
    No.               10,              2011,             better algorithms, part I and part II,
    http://queue.acm.org/detail.cfm?id=2047              2008,
    483                                                  http://anand.typepad.com/datawocky/20
[3] P. Helland, “If You Have Too Much Da-                08/03/more-data-usual.html
    ta, then “Good Enough” Is Good                  [16] J. Good, How many photos have ever
    Enough,” Vol. 9 No. 5, ACM Queue,                    been         taken?,     Sep        2011,
    2011                                                 http://1000memories.com/blog/
[4] Apache                          Hadoop,         [17] J. Bennett, S. Lanning, “The Netflix
    http://en.wikipedia.org/wiki/Hadoop                  Prize,” Proceedings of KDD Cup and
[5] Apache Software Foundation, Apache                   Workshop 2007, San Jose, California,
    Hadoop,                                              2007
    http://wiki.apache.org/hadoop/FrontPage         [18] J. Dean, S. Ghemawat, “MapReduce:
[6] J. Dean, S. Ghemawat, “MapReduce: A                  simplified data processing on large clus-
    Flexible Data Processing Tool,” Com-                 ters,” Commun. ACM, Vol. 51, No. 1,
    munications of the ACM, vol. 53, no. 1,              pp. 107–113, 2008.
    2010                                            [19] S. Messenger, Meet the world's most
[7] M. C. Chu-Carroll, Databases are ham-                powerful weather supercomputer, 2009,
    mers; MapReduce is a screwdriver,                    http://www.treehugger.com/clean-
    Good Math, Bad Math, 2008,                           technology/meet-the-worlds-most-
    http://scienceblogs.com/goodmath/                    powerful-weather-supercomputer.html
    2008/01/databases_are_hammers_mapre             [20] A. Pavlo, E. Paulson, A. Rasin, D.J.
    duc.php                                              Abadi, D.J. DeWitt, S. Madden and M
[8] Cisco, Cisco Visual Networking Index:                Stonebraker, “A comparison of ap-
    Global Mobile Data Traffic Forecast                  proaches to large-scale data analysis,” In
126                                                         Informatica Economică vol. 16, no. 2/2012


     Proceedings of the 2009 ACM SIGMOD                sulting, Inc., June 11, 2007, Available
     International Conference, 2009.                   online                                at:
[21] M. Tamer Özsu, P. Valduriez, “Distrib-            http://www.scribd.com/doc/55883818/A
     uted and Parallel Database System,”               -Generalized-Lesson-in-ETL-
     ACM Computing Surveys, vol. 28, 1996,             Architecture
     pp. 125 – 128                                [31] A. Albrecht, METL: Managing and Inte-
[22] Seagate, Performance Considerations,              grating ETL Processes, VLDB ‘09, Au-
     http://www.seagate.com/www/en-                    gust 2428, 2009, Lyon, France Copy-
     us/support/before_you_buy/speed_consi             right 2009 VLDB Endowment, ACM,
     derations                                         http://www.vldb.org/pvldb/2/vldb09-
[23] P. Vassiliadis, “A Survey of Extract-             1051.pdf
     Transform-Load Technology,” Interna-         [32] R. Davenport, ETL vs ELT, June 2008,
     tional Journal of Data Warehousing &              Insource IT Consultancy, Insource Data
     Mining, Vol. 5, No. 3, pp. 1-27, 2009             Academy,
[24] M. Stonebreaker et al., “MapReduce and            http://www.dataacademy.com/files/ETL-
     Parallel DBMSs: Friends or Foes,”                 vs-ELT-White-Paper.pdf
     Communications of the ACM 53(1):64--         [33] S. Ghemawat, H. Gobioff and Shun-Tak
     71 2010.                                          Leung (GOOGLE) - The Google File
[25] Apache                            HBASE,          System,
     http://hbase.apache.org/                          http://www.cs.brown.edu/courses/cs295-
[26] H. Yang, A. Dasdan, R. Hsiao, D. Par-             11/2006/gfs.pdf
     ker, “Map-reduce-merge: simplified re-       [34] C. Olston, B. Reed, U. Srivastava, R.
     lational data processing on large clus-           Kumar and A. Tomkins, “Pig Latin: A
     ters,” Rain (2007), Publisher: ACM,               Not-So-Foreign Language for Data Pro-
     Pages: 1029-1040 ISBN: 781595936868               cessing,” In SIGMOD ’08, pp. 1099–
     ,                                                 1110, 2008.
     http://www.mendeley.com/research/map         [35] S. Pukdesree, V. Lacharoj and P.
     reducemergesimplified-relational-data-            Sirisang, “Performance Evaluation of
     processing-on-large-clusters/                     Distributed Database on PC Cluster
[27] J. Dean, S. Ghemawat, “MapReduce:                 Computers,” WSEAS Transactions on
     Simplified Data Processing on Large               Computers, Issue 1, Vol. 10, January
     Clusters,” USENIX Association OSDI                2011, pp. 21 – 30, ISSN 1109-2750.
     ’04: 6th Symposium on Operating Sys-         [36] A. Abouzeid, K. Bajda-Pawlikowski,
     tems Design and Implementation,                   D.J. Abadi, A. Silberschatz and A.
     http://static.usenix.org/event/osdi04/tech        Rasin, “HadoopDB: An architectural hy-
     /full_papers/dean/dean.pdf                        brid of MapReduce and DBMS technol-
[28] X. Yu, “Estimating Language Models                ogies for analytical workloads,” In Pro-
     Using Hadoop and Hbase,” Master of                ceedings of the Conference on Very
     Science Artificial Intelligence, Universi-        Large Databases, 2009.
     ty        of       Edinburgh,        2008,   [37] C. Boja, A. Pocovnicu, “Distributed Par-
     http://homepages.inf.ed.ac.uk/miles/msc-          allel Architecture for Storing and Pro-
     projects/yu.pdf                                   cessing Large Datasets,” 11th WSEAS
[29] ETL             Architecture        Guide         Int. Conf. on SOFTWARE ENGINEER-
     http://www.ipcdesigns.com/etl_metadata            ING, PARALLEL and DISTRIBUTED
     /ETL_Architecture_Guide.html                      SYSTEMS (SEPADS '12), Cambridge,
[30] W. Dumey, A Generalized Lesson in                 UK, February 22-24, 2012, ISBN: 978-
     ETL Architecture Durable Impact Con-              1-61804-070-1, pp. 125-130.
Informatica Economică vol. 16, no. 2/2012                                                     127


                Catalin BOJA is Lecturer at the Economic Informatics Department at the
                Academy of Economic Studies in Bucharest, Romania. In June 2004 he has
                graduated the Faculty of Cybernetics, Statistics and Economic Informatics at
                the Academy of Economic Studies in Bucharest. In March 2006 he has grad-
                uated the Informatics Project Management Master program organized by the
                Academy of Economic Studies of Bucharest. He is a team member in various
                undergoing university research projects where he applied most of his project
management knowledge. Also he has received a type D IPMA certification in project man-
agement from Romanian Project Management Association which is partner of the IPMA or-
ganization. He is the author of more than 40 journal articles and scientific presentations at
conferences. His work focuses on the analysis of data structures, assembler and high level
programming languages. He is currently holding a PhD degree on software optimization and
on improvement of software applications performance.

                  Adrian POCOVNICU is a PhD Candidate at Academy of Economic Stud-
                  ies. His main research areas are: Multimedia Databases, Information Retriev-
                  al, Multimedia Compression Algorithms and Data Integration. He is a Data
                  Integration Consultant for ISA Consulting, USA.



                  Lorena BĂTĂGAN has graduated the Faculty of Cybernetics, Statistics and
                  Economic Informatics in 2002. She has become teaching assistant in 2002.
                  She has been university lecturer since 2009. She is university lecturer at Fac-
                  ulty of Cybernetics, Statistics and Economic Informatics from Academy of
                  Economic Studies. She holds a PhD degree in Economic Cybernetics and Sta-
                  tistics in 2007. She is the author and co-author of 4 books and over 50 articles
                  in journals and proceedings of national and international conferences, sympo-
siums.

Mais conteúdo relacionado

Mais procurados

A novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingA novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computing
João Gabriel Lima
 

Mais procurados (18)

I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
 
Hadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and AssessmentHadoop Cluster Analysis and Assessment
Hadoop Cluster Analysis and Assessment
 
Using BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined NetworkingUsing BIG DATA implementations onto Software Defined Networking
Using BIG DATA implementations onto Software Defined Networking
 
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUESANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
ANALYSIS OF ATTACK TECHNIQUES ON CLOUD BASED DATA DEDUPLICATION TECHNIQUES
 
The causes and consequences of too many bits
The causes and consequences of too many bitsThe causes and consequences of too many bits
The causes and consequences of too many bits
 
Dremel
DremelDremel
Dremel
 
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingBig data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
 
Hadoop
HadoopHadoop
Hadoop
 
A0930105
A0930105A0930105
A0930105
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
Towards a low cost etl system
Towards a low cost etl systemTowards a low cost etl system
Towards a low cost etl system
 
Combining IBM Real-time Compression and IBM ProtecTIER Deduplication
Combining IBM Real-time Compression and IBM ProtecTIER DeduplicationCombining IBM Real-time Compression and IBM ProtecTIER Deduplication
Combining IBM Real-time Compression and IBM ProtecTIER Deduplication
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
A novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computingA novel solution of distributed memory no sql database for cloud computing
A novel solution of distributed memory no sql database for cloud computing
 
Mining Of Big Data Using Map-Reduce Theorem
Mining Of Big Data Using Map-Reduce TheoremMining Of Big Data Using Map-Reduce Theorem
Mining Of Big Data Using Map-Reduce Theorem
 
Scaling Up vs. Scaling-out
Scaling Up vs. Scaling-outScaling Up vs. Scaling-out
Scaling Up vs. Scaling-out
 

Semelhante a Distributed parallel architecture for big data

Semelhante a Distributed parallel architecture for big data (20)

In memory big data management and processing a survey
In memory big data management and processing a surveyIn memory big data management and processing a survey
In memory big data management and processing a survey
 
Comparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsComparison of In-memory Data Platforms
Comparison of In-memory Data Platforms
 
Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...Efficient and scalable multitenant placement approach for in memory database ...
Efficient and scalable multitenant placement approach for in memory database ...
 
Study on Composable Infrastructure – Breakdown of Composable Memory
Study on Composable Infrastructure – Breakdown of Composable MemoryStudy on Composable Infrastructure – Breakdown of Composable Memory
Study on Composable Infrastructure – Breakdown of Composable Memory
 
E018142329
E018142329E018142329
E018142329
 
Google File System
Google File SystemGoogle File System
Google File System
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
IRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articlesIRJET- Systematic Review: Progression Study on BIG DATA articles
IRJET- Systematic Review: Progression Study on BIG DATA articles
 
191
191191
191
 
A Strategy for Improving the Performance of Small Files in Openstack Swift
 A Strategy for Improving the Performance of Small Files in Openstack Swift  A Strategy for Improving the Performance of Small Files in Openstack Swift
A Strategy for Improving the Performance of Small Files in Openstack Swift
 
hadoop seminar training report
hadoop seminar  training reporthadoop seminar  training report
hadoop seminar training report
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Srinivasan2-10-12
Srinivasan2-10-12Srinivasan2-10-12
Srinivasan2-10-12
 
H017144148
H017144148H017144148
H017144148
 
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
 
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesStorage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 Notes
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Big Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and ChallengesBig Data Paradigm - Analysis, Application and Challenges
Big Data Paradigm - Analysis, Application and Challenges
 
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUESBIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
 
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUESBIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
BIG DATA NETWORKING: REQUIREMENTS, ARCHITECTURE AND ISSUES
 

Último

Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
dlhescort
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
lizamodels9
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
daisycvs
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000
Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000
Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000
dlhescort
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 

Último (20)

RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
JAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
JAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLJAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
JAYNAGAR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort ServiceMalegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
Malegaon Call Girls Service ☎ ️82500–77686 ☎️ Enjoy 24/7 Escort Service
 
Cracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptxCracking the Cultural Competence Code.pptx
Cracking the Cultural Competence Code.pptx
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000
Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000
Call Girls In Majnu Ka Tilla 959961~3876 Shot 2000 Night 8000
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
Phases of Negotiation .pptx
 Phases of Negotiation .pptx Phases of Negotiation .pptx
Phases of Negotiation .pptx
 

Distributed parallel architecture for big data

  • 1. 116 Informatica Economică vol. 16, no. 2/2012 Distributed Parallel Architecture for "Big Data" Catalin BOJA, Adrian POCOVNICU, Lorena BĂTĂGAN Department of Economic Informatics and Cybernetics Academy of Economic Studies, Bucharest, Romania catalin.boja@ie.ase.ro, pocovnicu@gmail.com, lorena.batagan@ie.ase.ro This paper is an extension to the "Distributed Parallel Architecture for Storing and Pro- cessing Large Datasets" paper presented at the WSEAS SEPADS’12 conference in Cam- bridge. In its original version the paper went over the benefits of using a distributed parallel architecture to store and process large datasets. This paper analyzes the problem of storing, processing and retrieving meaningful insight from petabytes of data. It provides a survey on current distributed and parallel data processing technologies and, based on them, will pro- pose an architecture that can be used to solve the analyzed problem. In this version there is more emphasis put on distributed files systems and the ETL processes involved in a distribut- ed environment. Keywords: Large Dataset, Distributed, Parallel, Storage, Cluster, Cloud, MapReduce, Hadoop 1 Introduction A look back to the immediate history shows that the data storing capacity is con- years, figure 2. As a fact, in 1981 you must use 200 Seagate units, each having a five megabytes capacity and costing 1700$, to tinuously increasing, while the cost per GB store one gigabyte of data. stored is decreasing, figure 1. The first hard disk came from IBM in 1956, it was called The IBM 350 Disk Storage and it had a ca- pacity of 5 MB, [10-11]. In 1980, IBM 3380 breaks the gigabyte-capacity limit providing storage for 2.52 GB. After 27 years, Hitachi GST that acquired IBM drive division in 2003, deliver the first terabyte hard drive. Af- ter only two years, in 2009, Western Digital launches industry’s first two terabyte hard Fig. 2. Average $ cost per GB (based on drive. In 2011, Seagate introduces the [13]) world’s first 4TB hard drive, [11]. The reduce costs for data storage has repre- sented the main premise of the current data age, in which it is possible to record and store almost everything, from business to personal data. As an example, in the field of digital photos, it is estimated that around the world has been taken over 3.5 trillion photos and in 2011 alone have been made around 360 billion new snapshots, [16]. Also, the Fig. 1. Average HDD capacity (based on availability and the speed of Internet connec- [13]) tions around the world have generated in- creased data traffic that is generated by a In terms of price [12], the cost per gigabyte wide range of mobile devices and desktop decreased from an average of 300.000 $ to a computers. Only for mobile data, Cisco ex- merely an average of 11 cents in the last 30 pects that mobile data traffic will grow from
  • 2. Informatica Economică vol. 16, no. 2/2012 117 0.6 exabytes (EB) in 2011 to 6.3 EB in 2015, ment a cluster analysis model. [8]. The expansion of data communications has promoted different data services, social, 2 Processing and storing large datasets economic or scientific, to central nodes for As professor Anand Rajaraman questioned, storing and distributing large amounts of da- more data usually beats better algorithm ta. For example, Facebook social network [15]. The question is used to highlight the ef- hosts more than 140 billion photos, which is ficiency of a proposed algorithm for the Net- more than double compared to 60 billion pic- flix Challenge, [17]. Despite the statement is tures at the end of 2010. In terms of storage, still debatable, it brings up a true point. A all these snapshots data take up more than 14 given data mining algorithm yields better re- petabytes. In other fields, like research, the sults with more data and it can reach the Large Hadron Collider particle accelerator same accuracy of results of a better or more near Geneva, will produce about 15 petabytes complex algorithm. In the end, the objective of data per year. The SETI project is record- of a data analysis and mining system is to ing each month around 30 terabytes of data process more data with better algorithms, which are processed by over 250.000 com- [15]. puters each day, [14]. The supercomputer of In many fields more data is important be- the German Climate Computing Center cause it provides a more accurate description (DKRZ) has a storage capacity of 60 of the analyzed phenomenon. With more da- petabytes of climate data. In the financial ta, data mining algorithms are able to extract sector, records of every day financial opera- a wider group of influence factors and more tions generate huge amounts of data. Solely, subtle influences. the New York Stock Exchange records about Today, large datasets means volumes of hun- one terabyte of trade data per day, [15]. dreds of terabytes or petabytes and these are Despite this spectacular evolution of storage real scenarios. The problem of storing these capacities and of deposits size, the problem large datasets is generated by the impossibil- that arises is to be able to process it. This is- ity to have a drive with that size and more sue is generated by available computing important, by the large amount of time re- power, algorithms complexity and access quired to access it. speeds. This paper makes a survey of differ- Access speed of large data is affected by the ent technologies used to manage and process disk speed performances [22], Table 1, inter- large data volumes and proposes a distributed nal data transfer, external data transfer, cache and parallel architecture used to acquire, memory, access time, rotational latency, that store and process large datasets. The objec- generate delays and bottlenecks. tive of the proposed architecture is to imple- Table 1. Example of disk drives performance (source [22]) Interface HDD Average Internal transfer External transfer Cache Spindle rotational [rpm] latency [ms] [Mbps] [MBps] [MB] SATA 7,200 11 1030 300 8 – 32 SCSI 10,000 4.7 – 5.3 944 320 8 High-end SCSI 15,000 3.6 – 4.0 1142 320 8 – 16 SAS 10,000 / 15,000 2.9 – 4.4 1142 300 16 Despite the rapid evolution of drives capaci- 250 of them to store 1 PB of data. Once the ty, described in figure 1, the large datasets of storage problem is solved another question up to one petabytes can only be stored on arises, regarding how easy/hard is to read multiple disks. Using 4 TB drives requires that data. Considering an optimal transfer
  • 3. 118 Informatica Economică vol. 16, no. 2/2012 rate of 300 MB/s then the entire dataset is (file name, chunk index) read in 38.5 days. A simple solution is to read from the all disks at once. In this way, GFS Clients GFS Master the entire dataset is read in only 3.7 hours. (chunk handle, chunk loca- If we take into consideration the communica- tions) tion channel, then other bottlenecks are gen- erated by the available bandwidth. In the end, Instructions to chunk server the performance of the solution is reduced to the speed of the slowest component. Chunk server state Other characteristics of large data sets add supplementary levels of difficulty:  many input sources; in different econom- (chunk handle, byte range) Chunk Serv- ic and social fields there are multiple ers sources of information; (chunk data)  redundancy, as the same data can be pro- vided by different sources;  lack of normalization or data representa- tion standards; data can have different formats, unique IDs, measurement units;  different degrees of integrity and con- sistency; data that describes the same Fig. 3. Simplified GFS Architecture phenomenon can vary in terms of meas- ured characteristics, measuring units, In a distributed approach, the file system time of the record, methods used. aims to achieve the following goals: For limited datasets the efficient data man-  it should be scalable; the file system agement solution is given by relational SQL should allow for additional hardware to databases, [20], but for large datasets some of be added to increase storing capacity their founding principles are eroded [3], [20], and/or performance. [21].  it should offer high performance; the file system should be able to locate the data 3 Proposed Approach of interest on the distributed nodes in a The next solutions provide answers to the timely manner. question regarding how to store and process  it should reliable; the file system should large datasets in an efficient manner that can be able to recreate from the distributed justify the effort. nodes the original data in a complete and undistorted manner. 3.1 Large Data Sets Storage  it should have high availability; the file When accessing large data sets, the storage system should account for failures and file system can become a bottle neck. That's incorporate mechanisms for monitoring, why a lot of thought was put into redesigning error detection, fault tolerance and auto- the traditional file system for better perfor- matic recovery. mance when accessing large files of data. Google, one of the biggest search engine ser- vices providers, which handles “big web da- ta” has published a paper about the file sys- tem they claim it's being used by their busi- ness called "The Google File System". From an architecture point of view the Google File System (GFS) comprises of a "single master", multiple "chunk servers" and it's being accessed by multiple "clients".
  • 4. Informatica Economică vol. 16, no. 2/2012 119 In the simplified version of the GFS architec- ta into target data. ture, figure 3 it shows how the GFS clients The ETL process [32] is base on three ele- communicate with the master and with the ments: chunk servers. Basically the clients are ask-  Extract – The process in which the data is ing for a file and the GFS master tells them read from multiple source systems into a which chunk servers contain the data for that single format. In this process data is ex- file. Then the GFS clients make a request for tracted from the data source; the data at those respective chunk servers.  Transform – In this step, the source data Then GFS chunk servers transmit the data di- is transform into a format relevant to the rectly to the GFS clients. No user data actual- solution. The process transform the data ly passes the GFS master, this way it avoids from the various systems and made it having the GFS master as a bottleneck in the consistent; data transmission.  Load – The transformed data is now writ- Periodically, the GFS master communicates ten into the warehouse. with the chunk servers to get their state and Usually the systems that acquire data are op- to transmit them instructions. timized so that the data is being stored as fast Each chunk of data is identified by an immu- as possible. Most of the time comprehensive table and globally unique 64 bit chunk handle analyses require access to multiple sources of assigned by the master at the time of chunk data. It’s common that those sources store creation [33]. raw data that yields minimal information un- The chunk size is a key differentiator from less properly process. more common file system. GFS uses 64 MB This is where the ETL or ELT processes for a chunk, limiting this way the number or come into play. An ETL process will take the requests to the master for chunk locations, it data, stored in multiple sources, transform it, reduces the network overhead and it reduces so that the metrics and KPIs are readily ac- the metadata size on the master, allowing the cessible, and load it in an environment that master to store the metadata in memory. has been modeled so that the analysis queries Hadoop, a software framework derived from are more efficient [23]. An ETL system is Google's papers about MapReduce and GFS, part of a bigger architecture that includes at offers a similar file system, called Hadoop least one Database Management System, Distributed Files System (HDFS). DBMS. It is placed upstream from a DBMS What GFS was identifying as "Master" it's because it feeds data directly into the next being called NameNode in HDFS and the level. GFS "Chunk Servers" can be found as The ETL tools advantages are [29], [30], "Datanodes" in HDFS. [31]:  save time and costs when developing and 3.2 ETL maintaining data migration tasks; The Extract, Transform and Load (ETL) pro-  use for complex processes to extract, cess provides an intermediary transfor- transform, and load heterogeneous data mations layer between outside sources and into a data warehouse or to perform other the end target database. data migration tasks; In literature [29], [30] we identified ETL and  in larger organizations for different data ELT. ETL refers to extract, transform and integration and warehouse projects ac- load in this case activities start with the use cumulate; of applications to perform data transfor-  such processes encompass common sub- mations outside of a database on a row-by- processes, shared data sources and tar- row basis, and on the other hand ELT refers gets, and same or similar operations; to extract, load and transform which implied  ETL tools support all common databases, the use first the relational databases, before file formats and data transformations, performing any transformations of source da- simplify the reuse of already created
  • 5. 120 Informatica Economică vol. 16, no. 2/2012 (sub-)processes due to a collaborative de- termediate subsets associated with the same velopment platform and provide central intermediate key and send them to the Re- scheduling. duce function.  Portability: usually the ETL code can be The Reduce function, also accepts an inter- developed one a specific target database mediate key and subsets. This function merg- and ported later to other supported data- es together these subsets and key to form a bases. possibly smaller set of values. Normally just When developing the ETL process, there are zero or one output value is produced per Re- two options: either takes advantage of an ex- duce function. isting ETL tool, some key players, in this In [26] is highlight that many real world domain, are: IBM DataStage, Ab Initio, tasks such used MapReduce model. This Informatica or custom code it. Both ap- model is used for web search service, for proaches have benefits and pitfalls that need sorting and processing the data, for data min- to be carefully considered when selecting ing, for machine learning and for a big num- what better fits the specific environment. ber of other systems. The benefits of an in-house custom built ETL The entire framework manages how data is process are: split among nodes and how intermediary  Flexibility; the custom ETL process can query results are aggregate. be designed to solve requirements specif- A general MapReduce architecture can be il- ic to the organization that some of the lustrated as Figure 4. ETL tools may have limitation with;  Performance; the custom ETL process can be finely tuned for better perfor- mance;  Tool agnostic; the custom ETL process can be built using skills available in- house;  Cost efficient: custom ETL processes usually use resources already available in the organization, eliminating the addi- tional costs with licensing an ETL tool and training the internal resources on us- ing that specific ETL tool. 3.3 MapReduce, Hadoop and HBase MapReduce (MR) is a programming model and an associated implementation for pro- cessing and generating large data sets, [18]. The model was developed by Jeffrey Dean Fig. 4. MapReduce architecture (based on and Sanjay Ghemawat at Google. The foun- [28]) dations of the MapReduce model are defined by a map function used top process key-value The MR advantages are [1], [6], [7], [18], pairs and a reduce functions that merges all [20], [24], [27]: intermediate values of the same key.  the model is easy to use, even for pro- The large data set is split in smaller subsets grammers without experience with paral- which are processed in parallel by a large lel and distributed systems; cluster of commodity machines.  storage-system independence as it not re- Map function [27] takes an input data and quires proprietary database file systems produces a set of intermediate subsets. The or predefined data models; data is stored MapReduce library groups together all in- in plain text files and it is not required to respect relational data schemes or any
  • 6. Informatica Economică vol. 16, no. 2/2012 121 structure; in fact the architecture can use  query predicate push down via server data that has an arbitrary format; side scan and get filters;  fault tolerance;  optimizations for real time queries ;  the framework is available from high lev-  a Thrift gateway and a REST-ful Web el programming languages; one such so- service that supports XML, Protobuf, and lution is the open-source Apache Hadoop binary data encoding options project which is implemented in Java;  extensible JRuby based (JIRB) shell;  the query language allows record-level  support for exporting metrics via the manipulation; Hadoop metrics subsystem to files or  projects as Pig and Hive [34] are provid- Ganglia; or via JMX. ing a rich interface that allows program- HBase database stores data in labeled tables. mers to do join datasets without repeating In this context [28] the table is designed to simple MapReduce code fragments; have a sparse structure and data is stored in Hadoop is a distributed computing platform, table rows, and each row has a unique key which is an open source implementation of with arbitrary number of columns. the MapReduce framework proposed by Google [28]. It is based on Java and uses the 3.4 Parallel database systems Hadoop Distributed File System (HDFS). A distributed database (DDB) is a collection HDFS is the primary storage system used by of multiple, logically interconnected data- Hadoop applications. It is uses to create mul- bases distributed over a computer network. A tiple replicas of data blocks for reliability, distributed database management system, distributing them around the clusters and distributed DBMS, is the software system splitting the task into small blocks. The rela- that permits the management of the distribut- tionship between Hadoop, HBase and HDFS ed database and makes the distribution trans- can be illustrated as Figure 5. parent to the users. A parallel DBMS is a HBase is a distributed database. HBase is an DBMS implemented on a multiprocessor open source project for a database, distribut- computer. [21]. The parallel DBMS imple- ed, versioned, column-oriented, modeled af- ments the concept of horizontal partitioning ter Google’ Bigtable [25]. [24] by distributing parts of a large relational table across multiple nodes to be processed in HDFS parallel. This requires a partitioned execution Output of the SQL operators. Some basic operations, Output like a simple SELECT, can be executed in- dependently on all the nodes. More complex Input Input operations are executed through a multiple- operator pipeline. Different multiprocessor parallel system architectures [21], like share- Input memory, share-disks or share nothing, define Hadoop HBase possible strategies to implement a parallel Output DBMS, each with its own advantages and drawbacks. The share-nothing approach dis- Fig. 5. The relationship between Hadoop, tributes data across independent nodes and HBase and HDFS (based on [28]) has been implemented by many commercial systems as it provides extensibility and avail- Some as the features of HBASE as listed at ability. [25] are: Based on the above definitions, we can con-  convenient base classes for backing clude that parallel database systems improve Hadoop MapReduce jobs with HBase ta- performance of data processing by paralleliz- bles including cascading, hive and pig ing loading, indexing and querying data. In source and sink modules; distributed database systems, data is stored in
  • 7. 122 Informatica Economică vol. 16, no. 2/2012 different DBMSs that can function inde- also reduces the administration effort as it pendently. Because parallel database systems comes as a single vendor out of the box solu- may distribute data to increase the architec- tion. The data warehouse appliances offer ture performance, there is a fine line that sep- high availability through built-in fail-over arates the two concepts in real implementa- capabilities using data redundancy for each tions. disk. Despite the differences between parallel and Ideally, each processing unit of the data distributed DBMSs, most of their advantages warehouse appliance should process the same are common to a simple DBMS, [20],[35]: amount of data at any given time. To achieve  stored data is conform to a well-defined that, the data should be distributed uniformly schema; this validates the data and pro- across each processing unit. Data skew is a vides data integrity; measure to evaluate how data is distributed  data is structured in a relational paradigm across each processing unit. A data skew of 0 of rows and columns; means that the same number of records is  SQL queries are fast; distributed on each processing unit. A data  the SQL query language is flexible, easy skew of 0 is ideal. to learn and read and allows program- By having each processing unit do the same mers to implement complex operations amount of work it ensures that all processing with ease; units finish their task about the same time,  use hash or B-tree indexes to speed up minimizing any waiting times. access to data; Another aspect that has an important impact  can efficiently process datasets up to two on the query performance is having the all petabytes of data. the data that is related on the same pro- Known commercial parallel databases as Te- cessing unit. This way the time required to radata, Aster Data, Netezza [9], DATAllegro, transfer data between the processing units is Vertica, Greenplum, IBM DB2 and Oracle eliminated. For example, if the user requires Exadata, have been proven successful be- the sales by country report, having both the cause: sales data for a customer and his geographic  allow linear scale-up, [21]; the system information on the same processing unit will can maintain constant performance as the ensure that the processing unit has all the in- database size is increasing by adding formation that it needs and each processing more nodes to the parallel system; unit is able to perform its tasks independent-  allow linear speed-up, [21]; for a data- ly. base with a constant size, the perfor- The way data is distributed across the parallel mance can be increased by adding more database nodes influence the overall perfor- components like processors, memory and mance. Though the power of the parallel disks; DBMS is given by the number of nodes, this  implement inter-query, intra-query and can be also a drawback. For simple queries intra-operation parallelism, [21]; the actual processing time can be much  reduced implementation effort; smaller to the time needed to launch the par-  reduced administration effort; allel operation. Also, nodes can become hot  high availability. spots or bottle necks as they delay the entire In a massively parallel processing architec- system. ture (MPP), adding more hardware allows for more storage capacity and increases queries 4 Proposed architecture speeds. MPP architecture, implemented as a The proposed architecture is used to process data warehouse appliance, reduces the im- large financial datasets. The results of the da- plementation effort as the hardware and ta mining analysis help economists to identi- software are preinstalled and tested to work fy patterns in economic clusters that validate on the appliance, prior to the acquisition. It existing economic models or help to define
  • 8. Informatica Economică vol. 16, no. 2/2012 123 new ones. The bottom-up approach is more It is important that the ETL server also sup- efficient, but more difficult to implement, ports parallel processing allowing it to trans- because it can suggests, based on real eco- form large data sets in timely manner. ETL nomic data, relations between economic fac- tools like Ab Initio, DataStage and tors that are specific to the cluster model. The Informatica have this capability built in. If difficulty comes from the large volume of the ETL server does not support parallel pro- economic data that needs to be analyzed. cessing, then it should just define the trans- The architecture, described in figure 6, has formations and push the processing to the three layers: target parallel DBMS.  the input layer implements data acquisi- tion processes; it gets data from different sources, reports, data repositories and ar- chives which are managed by govern- mental and public structures, economic agencies and institutions, NGO projects; some global sources of statistical eco- nomic data are Eurostat, International Monetary Fund and World Bank; the problem of these sources is that they use independent data schemes and bringing them to a common format it is an inten- sive data processing stage taking into consideration national data sources or crawling the Web for free data, the task becomes a very complex one;  the data layer stores and process large da- tasets of economic and financial records; this layer implements distributed, parallel processing;  the user layer provides access to data and manage requests for analysis and reports. The ETL intermediary layer placed between the first two main layers, collects data from the data crawler and harvester component, converts it in a new form and loads it in the parallel DBMS data store. The ETL normal- ize data, transforms it based on a predefined structure and discards not needed or incon- sistent information. Fig. 6. Proposed architecture The ETL layer inserts data in the parallel dis- tributed DBMS that implements the Hadoop The user will submit his inquiries through a and MapReduce framework. The objective of front end application server, which will con- the layer is to normalize data and bring it to a vert them into queries and submit them to the common format, requested by the parallel parallel DBMS for processing. DBMS. The front end application server will include Using an ETL process, data collected by the an user friendly metadata layer, that will al- data crawler & harvester gets consolidated, low the user to query the data ad-hoc, it will transformed and loaded into the parallel also include canned reports and dashboards. DBMS, using a data model optimized for da- The objective of the proposed architecture is ta retrieval and analysis. to separate the layers that are data processing
  • 9. 124 Informatica Economică vol. 16, no. 2/2012 intensive and to link them by ETL services data schema in order to obtain a full de- that will act as data buffers or cache zones. scription of the data; the same schema is Also the ETL will support the transformation used to validate data; effort.  DBMSs use B-trees indexes to achieve For the end user the architecture is complete- fast searching times; indexes can be de- ly transparent. All he will experience is the fined on any attributes are managed by the look and feel of the front end application. system; MR framework does not imple- Based on extensive comparisons between ment this built-in facility and an imple- MR and parallel DBMS, [6], [20], [24], we mentation of a similar functionality is conclude that there is no all-scenarios good done by the programmers who control the solution for large scale data analysis because: data fetching mechanism;  both solutions can be used for the same  DBMSs provide high level querying lan- processing task; you can implement any guages, like SQL, which are ease to read parallel processing task based either on a and write; instead the MR use code frag- combinations of queries or a set of MR ments, seen as algorithms, to process rec- jobs; ords; projects as Pig and Hive [34] are  the two approaches are complementary; providing a rich interface based on high– each solution gives better performance level programming languages that allows over the other one in particular data sce- programmers to reuse code fragments. narios; you must decide with approach  parallel DBMSs have been proved to be saves time for the needed data processing more efficient in terms of speed but they task; are more vulnerable to node failures.  the process of configuring and loading da- A decision which approach to take must be ta into the parallel DBMS is more com- made taking into consideration: plex and time consuming than the setting  performance criteria; up of the MR architecture; one reason is  internal structure of processed data; that the DBMS requires complex schemas  available hardware infrastructure; to describe data, whereas MR can process  maintenance and software costs. data in arbitrary format; A combined MR-parallel DBMS solution,  the performance of the parallel DBMS is [36] can be a possibility as it benefits from given by system fine tune level; the sys- each approach advantages. tem must be configured accordingly to the tasks needed to complete and to the avail- 5 Conclusion able resources; Processing large datasets obtained from mul-  common MR implementations take full tiple sources is a daunting task as it requires advantage of the reduced complexity by tremendous storing and processing capaci- processing data with simple structure, be- ties. Also, processing and analyzing large cause the entire MR model is built on the volumes of data becomes non-feasible using key-value pair format; a MR system can a traditional serial approach. Distributing the be used to process more complex data, but data across multiple processing units and the input data structure must be integrated parallel processing unit yields linear im- in a custom parser in order to obtain ap- proved processing speeds. propriate semantics; not relaying on a When distributing the data is critical that common recognized data structure has an- each processing unit is allocated the same other drawback as data is not validated by number of records and that all the related da- default by the system; this can conduct to ta sets reside on the same processing unit. situations in which modified data violates Using a multi-layer architecture to acquire, integrity or other constraints; in contrast, transform, load and analyze the data, ensures the SQL query language used by any par- that each layer can use the best of bread for allel DBMS, takes full advantage of the its specific task. For the end user, the experi-
  • 10. Informatica Economică vol. 16, no. 2/2012 125 ence is transparent. Despite the number of Update, 2010–2015, layers that are behind the scenes, all that it is http://www.cisco.com/en/US/solutions/c exposed to him is a user friendly interface ollateral/ns341/ns525/ns537/ns705/ns82 supplied by the front end application server. 7/white_paper_c11-520862.html In the end, once the storing and processing [9] D. Henschen, “New York Stock Ex- issues are solved, the real problem is to change Ticks on Data Warehouse Appli- search for relationships between different ances,” InformationWeek, 2008, types of data [3]. Others has done it very http://www.informationweek.com/news/ successfully, like Google in Web searching software/bi/207800705 or Amazon in e-commerce. [10] IBM, IBM 350 disk storage unit, http://www- Acknowledgment 03.ibm.com/ibm/history/exhibits/storage This work was supported from the European /storage_profile.html Social Fund through Sectored Operational [11] Wikipedia, History of IBM magnetic Programmer Human Resources Development disk drives, 2007-2013, project number http://en.wikipedia.org/wiki/History_of_ POSDRU/89/1.5/S/59184, “Performance and IBM_magnetic_disk_drives excellence in postdoctoral research in Roma- [12] Wikipedia, History of hard disk drives, nian economics science domain”. http://en.wikipedia.org/wiki/History_of_ hard_disks References [13] I. Smith, Cost of Hard Drive Storage [1] T. White, Hadoop: The Definitive Guide, Space, O’Reilly, 2009 http://ns1758.ca/winch/winchest.html [2] G. Bruce Berriman, S. L. Groom, “How [14] Seti@home, SETI project, Will Astronomy Archives Survive the http://setiathome.berkeley.edu/ Data Tsunami?,” ACM Queue, Vol. 9 [15] A. Rajaraman, More data usually beats No. 10, 2011, better algorithms, part I and part II, http://queue.acm.org/detail.cfm?id=2047 2008, 483 http://anand.typepad.com/datawocky/20 [3] P. Helland, “If You Have Too Much Da- 08/03/more-data-usual.html ta, then “Good Enough” Is Good [16] J. Good, How many photos have ever Enough,” Vol. 9 No. 5, ACM Queue, been taken?, Sep 2011, 2011 http://1000memories.com/blog/ [4] Apache Hadoop, [17] J. Bennett, S. Lanning, “The Netflix http://en.wikipedia.org/wiki/Hadoop Prize,” Proceedings of KDD Cup and [5] Apache Software Foundation, Apache Workshop 2007, San Jose, California, Hadoop, 2007 http://wiki.apache.org/hadoop/FrontPage [18] J. Dean, S. Ghemawat, “MapReduce: [6] J. Dean, S. Ghemawat, “MapReduce: A simplified data processing on large clus- Flexible Data Processing Tool,” Com- ters,” Commun. ACM, Vol. 51, No. 1, munications of the ACM, vol. 53, no. 1, pp. 107–113, 2008. 2010 [19] S. Messenger, Meet the world's most [7] M. C. Chu-Carroll, Databases are ham- powerful weather supercomputer, 2009, mers; MapReduce is a screwdriver, http://www.treehugger.com/clean- Good Math, Bad Math, 2008, technology/meet-the-worlds-most- http://scienceblogs.com/goodmath/ powerful-weather-supercomputer.html 2008/01/databases_are_hammers_mapre [20] A. Pavlo, E. Paulson, A. Rasin, D.J. duc.php Abadi, D.J. DeWitt, S. Madden and M [8] Cisco, Cisco Visual Networking Index: Stonebraker, “A comparison of ap- Global Mobile Data Traffic Forecast proaches to large-scale data analysis,” In
  • 11. 126 Informatica Economică vol. 16, no. 2/2012 Proceedings of the 2009 ACM SIGMOD sulting, Inc., June 11, 2007, Available International Conference, 2009. online at: [21] M. Tamer Özsu, P. Valduriez, “Distrib- http://www.scribd.com/doc/55883818/A uted and Parallel Database System,” -Generalized-Lesson-in-ETL- ACM Computing Surveys, vol. 28, 1996, Architecture pp. 125 – 128 [31] A. Albrecht, METL: Managing and Inte- [22] Seagate, Performance Considerations, grating ETL Processes, VLDB ‘09, Au- http://www.seagate.com/www/en- gust 2428, 2009, Lyon, France Copy- us/support/before_you_buy/speed_consi right 2009 VLDB Endowment, ACM, derations http://www.vldb.org/pvldb/2/vldb09- [23] P. Vassiliadis, “A Survey of Extract- 1051.pdf Transform-Load Technology,” Interna- [32] R. Davenport, ETL vs ELT, June 2008, tional Journal of Data Warehousing & Insource IT Consultancy, Insource Data Mining, Vol. 5, No. 3, pp. 1-27, 2009 Academy, [24] M. Stonebreaker et al., “MapReduce and http://www.dataacademy.com/files/ETL- Parallel DBMSs: Friends or Foes,” vs-ELT-White-Paper.pdf Communications of the ACM 53(1):64-- [33] S. Ghemawat, H. Gobioff and Shun-Tak 71 2010. Leung (GOOGLE) - The Google File [25] Apache HBASE, System, http://hbase.apache.org/ http://www.cs.brown.edu/courses/cs295- [26] H. Yang, A. Dasdan, R. Hsiao, D. Par- 11/2006/gfs.pdf ker, “Map-reduce-merge: simplified re- [34] C. Olston, B. Reed, U. Srivastava, R. lational data processing on large clus- Kumar and A. Tomkins, “Pig Latin: A ters,” Rain (2007), Publisher: ACM, Not-So-Foreign Language for Data Pro- Pages: 1029-1040 ISBN: 781595936868 cessing,” In SIGMOD ’08, pp. 1099– , 1110, 2008. http://www.mendeley.com/research/map [35] S. Pukdesree, V. Lacharoj and P. reducemergesimplified-relational-data- Sirisang, “Performance Evaluation of processing-on-large-clusters/ Distributed Database on PC Cluster [27] J. Dean, S. Ghemawat, “MapReduce: Computers,” WSEAS Transactions on Simplified Data Processing on Large Computers, Issue 1, Vol. 10, January Clusters,” USENIX Association OSDI 2011, pp. 21 – 30, ISSN 1109-2750. ’04: 6th Symposium on Operating Sys- [36] A. Abouzeid, K. Bajda-Pawlikowski, tems Design and Implementation, D.J. Abadi, A. Silberschatz and A. http://static.usenix.org/event/osdi04/tech Rasin, “HadoopDB: An architectural hy- /full_papers/dean/dean.pdf brid of MapReduce and DBMS technol- [28] X. Yu, “Estimating Language Models ogies for analytical workloads,” In Pro- Using Hadoop and Hbase,” Master of ceedings of the Conference on Very Science Artificial Intelligence, Universi- Large Databases, 2009. ty of Edinburgh, 2008, [37] C. Boja, A. Pocovnicu, “Distributed Par- http://homepages.inf.ed.ac.uk/miles/msc- allel Architecture for Storing and Pro- projects/yu.pdf cessing Large Datasets,” 11th WSEAS [29] ETL Architecture Guide Int. Conf. on SOFTWARE ENGINEER- http://www.ipcdesigns.com/etl_metadata ING, PARALLEL and DISTRIBUTED /ETL_Architecture_Guide.html SYSTEMS (SEPADS '12), Cambridge, [30] W. Dumey, A Generalized Lesson in UK, February 22-24, 2012, ISBN: 978- ETL Architecture Durable Impact Con- 1-61804-070-1, pp. 125-130.
  • 12. Informatica Economică vol. 16, no. 2/2012 127 Catalin BOJA is Lecturer at the Economic Informatics Department at the Academy of Economic Studies in Bucharest, Romania. In June 2004 he has graduated the Faculty of Cybernetics, Statistics and Economic Informatics at the Academy of Economic Studies in Bucharest. In March 2006 he has grad- uated the Informatics Project Management Master program organized by the Academy of Economic Studies of Bucharest. He is a team member in various undergoing university research projects where he applied most of his project management knowledge. Also he has received a type D IPMA certification in project man- agement from Romanian Project Management Association which is partner of the IPMA or- ganization. He is the author of more than 40 journal articles and scientific presentations at conferences. His work focuses on the analysis of data structures, assembler and high level programming languages. He is currently holding a PhD degree on software optimization and on improvement of software applications performance. Adrian POCOVNICU is a PhD Candidate at Academy of Economic Stud- ies. His main research areas are: Multimedia Databases, Information Retriev- al, Multimedia Compression Algorithms and Data Integration. He is a Data Integration Consultant for ISA Consulting, USA. Lorena BĂTĂGAN has graduated the Faculty of Cybernetics, Statistics and Economic Informatics in 2002. She has become teaching assistant in 2002. She has been university lecturer since 2009. She is university lecturer at Fac- ulty of Cybernetics, Statistics and Economic Informatics from Academy of Economic Studies. She holds a PhD degree in Economic Cybernetics and Sta- tistics in 2007. She is the author and co-author of 4 books and over 50 articles in journals and proceedings of national and international conferences, sympo- siums.