SlideShare uma empresa Scribd logo
1 de 21
Baixar para ler offline
<Client – Confidential>




  Architecture Proposal
         Prepared By

        Bernard Dedieu
Table of Content
1. Background ............................................................................................................ 3
2. Problem Statement ................................................................................................. 3
3. Proposed Architecture—High-Level ........................................................................ 4
a. 100GB-scale Data Volume ..................................................................................... 4
b. Log Files as Data Source ........................................................................................ 4
c. Customer-facing OLAP ........................................................................................... 4
4. Proposed Architecture—Low-Level ......................................................................... 6
a. Hadoop ................................................................................................................... 6
b. Data Marts .............................................................................................................. 9
i. One vs. Many.......................................................................................................... 9
ii. Brand of RDBMS .................................................................................................. 10
c. Reporting Portal .................................................................................................... 11
d. Hardware .............................................................................................................. 12
e. Java Programming ................................................................................................ 12
5. Data Anomaly Detection ....................................................................................... 12
6. Data integration/importation and Data Quality Management ................................. 12
7. Summary .............................................................................................................. 13
Appendix A. Hadoop Overview ................................................................................... 14
MapReduce ................................................................................................................ 14
Map............................................................................................................................. 14
Reduce ....................................................................................................................... 14
Hadoop Distributed File System (HDFS) ..................................................................... 16
8. Query Optimization ............................................................................................... 18
9. Access and Data Security ..................................................................................... 18
10. Internal Management and Collaboration tools ....................................................... 18
11. Sales Force and Force.com integration ................................................................ 19
12. Roadmap .............................................................................................................. 20



         2    ...                                       Architecture Proposal                                        Confidential
1. Background
<Company presentation and background – Confidential>

   2. Problem Statement
   In term of load for the database, the number of sites is the best metric since it describes the number …. So
   it is very important that the web-application remains effective as the company is growing (this includes the
   database, the framework and the architecture of the servers). Also, as the company grows in … need to
   deploy a server in Europe to manage ...
   In addition, the historical data will be kept and the number of ... will grow the data volume will grow
   exponentially. So the overall database architecture needs to be highly and easily scalable.
   It is also more than likely as the solution price will decrease, bigger corporations will be interested in ...
   solution. Therefore, ... solution will need to be integrated in existing information systems.
   This will require:
         To interface ... solution to existing applications.
         To have ... solution relying on standard and open technologies.
         To build partnership with System Integrators or build an internal Professional Services organization
            to support these customers.

   With its current, somewhat limited database schema, the data warehouse’s millions of records consume
   more than 2GB of disk space, including indexes. Extensions to the data warehouse schema, coupled with
   a growing customer base, will easily push the data warehouse volume beyond 100GB. The single
   instance, multi-schema MySQL database architecture simply does not provide the scalability necessary to
   meet ... demands.

   In addition to these scalability problems, the reporting infrastructure is also limited in its potential for
   enhanced functionality. For instance, ... would like to extend the Reporting Portal to provide customers
   with ad-hoc, multi-dimensional query capability and custom reporting based on searchable attribute tags in
   the data warehouse. At present, the data warehouse dimensions do not provide the flexibility needed to
   easily accommodate these kinds of changes.

   Therefore, ... has a pressing need to replace its current reporting infrastructure with a scalable, flexible
   architecture that can not only accommodate their growing data volumes, but also dramatically extend their
   reporting functionality. Key goals for the new infrastructure include:
       Redundant, efficient retention of historical detail
               o Write once, ready many
               o Compression
               o No encryption required
               o ANSI-7 single-byte code page is sufficient
       Linear scalability (i.e., as data volume increases, performance is not degraded)
       Flexible extensibility (e.g., attributes can easily be added and exposed to customers for reporting,
           either as dimensional attributes or fact attributes)




       3   ...                                 Architecture Proposal                                 Confidential
    Full OLAP support
            o Standard reports
            o Custom reports
            o Ad-hoc query
            o Multi-dimensional
            o Hierarchical categories (i.e., tagging, snowflakes)
            o Charts and graphs
            o Drill-down to atomic detail (i.e., ... log)
            o 24x7 availability
            o Query response time measured in seconds (not minutes)
       Efficient ETL
            o Near real time (i.e., < 15 minutes)
            o Handles fluctuating volumes throughout the day without becoming a bottleneck (which can
                cause synchronization problems in the data warehouse)
       Partitioning of data by customer

This new architecture must deliver vastly improved functionality, while controlling for implementation cost
and time to roll-out.

3. Proposed Architecture—High-Level
   From an architectural perspective, there are three overarching factors driving the technical solution for
   ... reporting needs:

               a. 100GB-scale Data Volume
               Due to their sheer size, large applications like ...s data warehouse require more resources than
               can typically be served by a single, cost-effective machine. Even if a large, expensive server
               could be configured with enough disk and CPU to handle the heavy workload, it is unlikely that a
               single machine could provide the continuous, uninterrupted operation needed to meet ... SLAs.
               A cloud computing architecture, on the other hand, is an economical, scalable solution that
               provides seamless fault tolerance for large data applications.

                b. Log Files as Data Source
               More and more organizations are seeking to leverage the rich content in their verbose log files
               to drive business intelligence. Sourcing from log files presents a different set of challenges
               compared to selecting data out of a highly structured OLTP database. Efficient, robust, and
               flexible parsing routines most be programmed to identify tagged attributes and map these to
               business constructs in the data warehouse. And because log files tend to consume lots of disk
               space, they should ideally be stored in a distributed file system in order to load balance I/O and
               improve fault tolerance.

                c. Customer-facing OLAP
               The stakes are usually higher when building and maintaining a customer-facing business
               intelligence solution, as opposed to one that is implementing internally. ... reputation and
               marketability depend in part on its customers’ opinions of the Reporting Portal. It must be
               intuitive, easy to use, powerful, secure, and available anytime. Its data should be as fresh as
               possible, while providing historical data for trend analyses. Customers should have seamless
               access to both aggregated metrics and ... log detail. The Reporting Portal should expose the
               customizability of the speech application through its reports. Any customer-specific categories,
               tags, and data content should be faithfully reflected in the Reporting Portal, just as the customer
               would expect to see them.



    4    ...                                    Architecture Proposal                                 Confidential
Based on these driving factors, we propose a cloud computing architecture comprising a distributed file
system, distributed file processing, one or more relational data marts, and a browser-based OLAP
package (see Figure 1). Most of this infrastructure will be built using open source software
technologies running on commodity hardware. This strategy keeps initial implementation costs low for
a right-sized solution, while providing a path for scalable growth.

Figure 1. High-Level Architecture

            ... logs        Hadoop Distributed              Relational          Reporting
                            File System (HDFS)             Data Mart(s)          Portal




    ... logs are retained   ...logs are                  Any portion of        Reports, ad-
    for ever (or as         immediately                  historical data can   hoc queries,
    otherwise specified     replicated into              be read from          graphs, and
    per customer            HDFS and can be              Hadoop and            charts are
    requirements).          retained indefinitely.       aggregated as
                                                         needed into
                                                                               presented via
                                                         optimized reporting   browser-based
                                                         database(s).          software.



In this design, Apache Hadoop (http://hadoop.apache.org/) is used to perform some of the functions
normally provided by a relational data warehouse. Most specifically, Hadoop behaves as the system of
record, storing all of the historical detail generated by the Speech Applications. New ... logs are
immediately replicated into the Hadoop Distributed File System (HDFS), which is massively scalable to
accommodate virtually any amount of data. HDFS is based on Google’s GFS, which essentially stores
the content of the Web in order to facilitate index generation. Other well-known companies that store
huge volumes on data in HDFS include Yahoo!, AOL, Facebook, and Amazon. Hadoop is free to
download and install. It uses a cloud computing architecture (i.e., lots of inexpensive computers linked
together, sharing workload), so it can be easily and economically extended as needed to scale for
growth. Scaling performance is linear; performance does not degrade as you increase data volume.

Hadoop cannot fulfill all of the functions of a data warehouse, though. For instance, it does not contain
indexes like a relational database, so it can’t truly be optimized to return query results quickly. Hadoop
provides a very powerful, distributed job processing technology called MapReduce, which can perform
much of the extract and transform work that is commonly done by ETL tools. Therefore, Hadoop
powerfully augments ... business intelligence architecture by using distributed storage and processing
to perform the data warehousing functions that would otherwise be the hardest to scale under a
traditional, single-machine, relational data warehouse architecture.




5     ...                                   Architecture Proposal                              Confidential
While Hadoop does the ―heavy lifting,‖ other, more traditional technologies are used to provide familiar
business intelligence functionality. Relational data marts serve up optimized OLAP database schemas
(e.g., highly indexed star schemas) for querying via standard business intelligence tools. One defining
factor of a data mart is that it can be completely truncated and reloaded from the upstream data
repository (in this case, Hadoop) as needed. This means that if ... needs to enhance the reporting
database design by altering a dimension or adding new metrics, the data mart’s schema can altered—
even dramatically—and repopulated without the risk of losing any historical data. It’s also worth noting
that because the Hadoop repository stores all historical detail, it is possible to retroactively back-
populate new metrics that re added to the data mart(s).
As of this writing, it is not know how much data volume must be accommodated in a given data mart.
And we don’t yet know whether one data mart would suffice, or if there would be many data marts.
These questions will influence the choice of relational database management system (RDBMS) that is
selected for .... For example, MySQL is cheap to procure and implement, but has serious scalability
limitations. A columnar MPP database like ParAccel is ideal for handling multi-terabyte data volumes,
but comes with a price tag. One advantage of this proposed architecture, though, is that the data marts
can be migrated from one technology to another without risk of losing valuable data.

The customer-facing front end technology should be a mature, fully-supported product like
BusinessObjects or MicroStrategy. Such technologies are rich with features that would otherwise be
very costly to develop in-house, even with open source Java libraries. Besides, the customers who use
this interface should not become quality assurance testers for internally developed user interfaces. The
Reporting Portal is a marketed service and as such, must leave customers with a great impression.

4. Proposed Architecture—Low-Level
This section outlines an in-depth look at each component in Figure 1 above.

           a. Hadoop
           Hadoop is an extremely powerful open source technology that does certain things very well, like
           store immense volumes of data and perform distributed computations on that data. Some of
           these strengths can be leveraged within the context of a business intelligence application.

           For instance, several of the functions that would normally be performed within a traditional data
           warehouse could be taken up by Hadoop. One defining feature of a data warehouse is that it
           stores historical data. While source systems may only keep a rolling window of recent data, the
           data warehouse retains all or most of the history. This frees up the transactional systems to
           efficiently run the business, while keeping a historical system of record in the data warehouse.

           HDFS is ideal for archiving large volumes of static data, such as ... ... logs. HDFS provides
           linear scalability as data volumes increase. Not only can HDFS easily handle ... for ever
           retention requirement, but if could also permit ... to retain all of its history. HDFS comfortably
           scales into the petabyte range, so the need to age out and purge files could be eliminated
           altogether.

           Hadoop is a perfect solution for historicity problems, because it easily scales to petabyte sizes
           by simply by configuring additional hardware into the cluster.
           Another benefit of HDFS is its data redundancy. HDFS replicates file blocks across nodes,
           which can physically reside in the same data center or in another data center (assuming the
           VPN bandwidth supports it). This would entirely eliminate the need for ... to copy zipped ... log
           files between data centers (see Figure 2).




6    ...                                    Architecture Proposal                                Confidential
Figure 2. ... Log-Hadoop Architecture


           VXML
           … Logs




                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     or
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     R
                                                                     e
                                                                     e
                                                                     e
                                                                     e
                                                                     e
                                                                     e
                                                                     e
                                                                     e
                                                                     e
                                                                     d
                                                                     e
                                                                     d
                                                                     e
                                                                     d
                                                                     e
                                                                     d
                                                                     e
                                                                     d
                                                                     e
                                                                     d
                                                                     e
                                                                     d
                                                                     d
                                                                     d
                                                                     d
                                                                     d
                                                                     d
                                                                     d
                                                                     d
                                                                     d
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     c
                                                                     S     S    S
                                                                      V
                                                                      V   KV   KV
                                                                               KV
                                                                              KpV KpV KpV
                                               RJ         R
                                                          T                   Klial KliV Klial
                                                                                        V e V
                                                                              K al K al K al
                                                                                 V
                                                                                 V
                                                                              e al e al S al e V
                                               Na
                                                o         a
                                                          T                   e KSV e al K e al
                                                                              e tal e tal e tal
                                                                                 al e u e u
                                                                                 u Vy u y u  V
                                                          a                   y K Vy uK y u
                                                                                 u y u hu
                                                                                 u y
                                                                                  hu
                                                                              y K al         V
                                                                                             y
                                                                                             V
                          A Java program        a
                                                T
                                                b
                                               ck         T
                                                         ck
                                                          a                   y e al 2 e K 1 u
                                                                                 uV
                                                                                 uV          y
                                                                                             V
                                                                                 e al 2 e e 7 u
                                                                                             al
                                                                                             y
                                                                              5 Kffl 3 e Kffl e
                                                                                 e al
                                                                                 e 4 ee 2 e  V
                                                                                             al e
                                                                                             al
                                                                                             V
                                                T         s
                                                          T                   6 eRu 6 eRu e
                                                                                 e al
                                                                              8 e al
                                                                                 e u
                                                                                 y u 7e 8
                                                                                 e
                                                                                 0
                                                                                             V
                                                                                             al e
                                                                                             4 1
                                                                                             al 2
                                                                                 Aeu 8 y eal 3
                        reads each ... log     ma         a                                  u
                                                                                             u




                                                                      or
                                                                      or
                                                                      or
                                                                      or
                                                                      or
                                                                      or
                                                                      or
                                                                      or
                                                                      or
                                                                      or
                                                T                                y eu 9 y eal 4
                                                                                 B




                                                                      R
                                                                      R
                                                                      R
                                                                      R
                                                                      R
                                                                      R
                                                                      R
                                                                      R
                                                                      R
                                                                      R
                                                          s




                                                                      e
                                                                      e
                                                                      e
                                                                      e
                                                                      e
                                                                      e
                                                                      e
                                                                      d
                                                                      e
                                                                      d
                                                                      e
                                                                      d
                                                                      e
                                                                      d
                                                                      d
                                                                      d
                                                                      d
                                                                      d
                                                                      d
                                                                      d
                                                                                             al 5




                                                                      c
                                                                      c
                                                                      c
                                                                      c
                                                                      c
                                                                      c
                                                                      c
                                                                      c
                                                                      c
                                                                      c
                                                          k                      Cu          u
                                                T
                                                a
                                                e         T
                                                          a                      y e
                                                                                 D
                                                                                 1 e
                                                                                 y e      2 u
                                                                                          y e
                                                                                          yAeu
                                                                                             e
                         and writes it into     s
                                                Tr        s
                                                          k
                                                          T                      3Ae
                                                                                 5d0e     4du
                                                                                          6 e
                                                                                             u
                                                                                             e
                                                                                             3
                                                                                             e
                                HDFS for       Na
                                                s         a
                                                          s                      7 A
                                                                                  nd
                                                                                    1
                                                                                    e
                                                                                    6     8 7
                                                                                           nd8
                                                                                             e
                                                                                             e
                                                                                             4
                                                k HDFS T
                                                a
                                                a         k
                                                          a                        uB
                                                                                    2        9
                                                                                            uC
                               permanent        s
                                                k         s
                                                          kr                      MapReduce
                                                                                   S       S5D
                                                o
                                                T         T
                                                c(Figure A-2)
                                                s         sr                       c        c
                                                                                  (Figure A-1)
                                 storage.       k
                                                T         a
                                                          k
                                                          T                       ort
                                                                                   e       ort
                                                                                            e
                                                d
                                                kr        a
                                                          kr
                                                Tr        c
                                                          Tr                       T        T
                                                e
                                                a
                                                e
                                                T         a
                                                          c
                                                          T                        a        a
                                                ar        k
                                                          ar
                                                cr
                                                 r        c
                                                          kr                       s        s
                                                a
                                                c         e
                                                          a
                                                          c                        k        k
                                                k
                                            The a         k
                                                          e
                                                 Hadoop Distributed File System (HDFS) can
                                                          a
                                                c
                                                k         r/
                                                          c
                                                          k transparently replicate data
                                                e
                                                c         e
                                            be configuredr/
                                                          cto
                                                k
                                                e         D
                                                          k
                                                          e
                                                r/
                                                k         r/
                                            across racks D across data centers, providing
                                                          and
                                                          k
                                                e
                                                r/        a
                                                          e
                                                          r/
                                            redundant failover copies of all file blocks.
                                               De         D
                                                          a
                                                          e
                                               Dr/         t
                                                          r/
                                                          D
                                                a
                                                r/        at
                                                          r/
                                               Da         a
                                                          D
                                                          a
                                               D t        a
                                                          Dt
                                                at        N
                                                          at
Although business intelligence solutions depend on lots of data, business users are interested in
                                                a
                                                a         a
                                                          N
                                                          a
                                                at        ot
                                                          a of raw data into meaningful business metrics,
information. In order to transform large volumesN
                                               N t        o
                                                          dt
                                                a         a
                                               N must be applied, and large numbers of data elements
                                                          N
calculations must be performed, business rules  o
                                                a         o
                                                          d
                                                          a
                                               No         e
                                                          N
                                                          o
must be summarized into a few figures.         Nd         d
                                                          e
                                                          N
                                                o
                                                d         o
                                                          d
                                                e
                                                o         e
                                                          o
                                                d
                                                e         d
                                                          e
Traditionally, this type of aggregation worke   d is done outside of the data warehouse by an extract,
                                                          d
                                                          e
                                                e         e
transform, and load (ETL) tool, or within the data warehouse using stored procedures and materialized
views. Due to the inherent constraints imposed by a relational database system like MySQL, there are
limits to how much data can reasonably be aggregated this way. As source data volumes increase, the
time required to perform aggregations can extend beyond the point in time when the resulting metrics
are needed by the customers.

Hadoop is able to perform these kinds of aggregations much quicker on large data volumes because it
distributes the processing across many computers, each one crunching the numbers for a subset of the
source data. Consequently, aggregated metrics that might have taken days to calculate in a traditional
data warehouse model can be churned out by Hadoop in a couple of hours or even minutes.

MapReduce is particularly well-suited to structured data sets like ... ... logs. Tagged attributes map
easily to key/value pairs, which are the transactional unit of MapReduce jobs (see Figure A-1 in
appendix). ... ETL routines could therefore be replaced with Java MapReduce jobs read from HDFS ...
log files and write to the data marts (see Figure 3).




7    ...                                Architecture Proposal                                Confidential
Figure 3. Hadoop MapReduce Architecture


                                                                                     Other
                                                                                     Tools
                                                                                     …


                                                                                       Relational




                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        or
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        R
                                        e
                                        e
                                        e
                                        e
                                        e
                                        e
                                        e
                                        e
                                        e
                                        d
                                        e
                                        d
                                        e
                                        d
                                        e
                                        d
                                        e
                                        d
                                        e
                                        d
                                        e
                                        d
                                        d
                                        d
                                        d
                                        d
                                        d
                                        d
                                        d
                                        d
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                        c
                                         S       S       S                            Data Mart(s)
       RJ          R                    KpV KpV KpV
                                           V KV KV
                                           V KV KV
                                        K al K al K al
                                           V e V K al
                   T                    Klial
                                           V           e V
                                                       e V
                                        e al e lial S lial
       N
       ao          a
                   T                    e KSV e al K e al
                                        e tal e tal e tal
                                           al y u e u
                                           u Vy u y u  V
                   a                    y K Vy uK y u
                                           u y u hu
                                           uhu
                                        y K al         V
                                                       y
                                                       V
       T
      cka
        b          T
                  ck
                   a                    y e al 2 e K 1 u
                                           uV
                                           uV
                                           e al
                                           e 4 ee 2 e
                                                       y
                                                       V
                                           e al 2 e e 7 u
                                                       al
                                                       y
                                        5 Kffl 3 e Kffl e
                                                       V
                                                       al e
                                                       al
       T           s
                   T                    6 eRu 6 eRV e
                                           e al
                                        8 e al
                                           e u
                                           y u 7e 8
                                           e
                                           0
                                                       V
                                                       al e
                                                       4 1
                                                       al 2
                                                       u
                                           Aeu 8 y eal 3
                                                                            JDBC
       ma          a                                   u
                                                       u

                                         or
                                         or
                                         or
                                         or
                                         or
                                         or
                                         or
                                         or
                                         or
                                         or
       T                                   y eu 9 y eal 4
                                           B
                                         R
                                         R
                                         R
                                         R
                                         R
                                         R
                                         R
                                         R
                                         R
                                         R
                   s
                                         e
                                         e
                                         e
                                         e
                                         e
                                         e
                                         e
                                         d
                                         e
                                         d
                                         e
                                         d
                                         e
                                         d
                                         d
                                         d
                                         d
                                         d
                                         d
                                         d
                                                       al 5
                                         c
                                         c
                                         c
                                         c
                                         c
                                         c
                                         c
                                         c
                                         c
                                         c
       T           k
                   T                       Cu
                                           y e
                                           D           u
                                                       u
                                                       e
        a          a                       1 e
                                           y e      y u
                                                    2 e
       Te
        sr         s
                   k
                   T                       3Ae
                                           5d0e     4de
                                                    yAu
                                                    6 3
                                                       u
                                                       e
                                                       e
                                                       e
       Na
        s          a
                   s                       7 A
                                            nd
                                              1
                                              e
                                              6     8 e
                                                     nd
                                                       7
                                                       8
                                                       e
                                                       4
        k HDFS T
        a
        a          k
                   a                        uB2      uC9
        s
        k          s
                   kr                      MapReduce
                                            S        S5D
       To          T
        c(Figure A-2)
        s          sr                        c        c
                                           (Figure A-1)
       Tk          a
                   k
                   T                        ort
                                             e       ort
                                                      e
        d
        kr         a
                   kr
       T r         c
                   Tr                       T        T         Java programs execute
       Te
        a
        e          a
                   c
                   T                         a        a
        ar         k
                   ar                                          MapReduce jobs to extract and
        c
    The ar
         r         c
         entire history of ... logs is permanentlysstored
                   kr                        s
        c          e
                   a
                   c                         k        k        transform any subset of ... log data,
        k
        a          k
                   e
                   a
    in Hadoop, making it possible to back-populate
        c
        k         r/
                   c
                   k                                           and then write the aggregated
    newe metrics with old data, perform year-over-
        c
        kBI       D
                   e
                  r/
                   c
                   k                                           results into the relational data marts
        e          e
        r/trend reports, and manually mine data as
    yeark         r/
        e         Dk
                   a
                   e                                           via JDBC.
        r/
        D
    needed.       r/
                  D
        e
        r/       t
                r/ a
                   e
        D
        a       D
                a
        r/
        D       a
                Dt
                r/
        at      at
        D       a
                D quite a few maturing open source tools that can provide analysts direct access to
                N
There a at      a
         are also
                at
        at      N
                a
                ot
Hadoop data. N
        a
        N       a
                oFor instance, a desktop tool like HBase or Hive can be used as a SQL-like interface into
        at      d
                at
Hadoop, N permitting analysts to run queries in much the same way that they would access a traditional
        o       N
                o
        a
        N       d
                a
                e
                N
        o       o
data warehouse. These tools might be useful to ... personnel who want to perform analyses that are
        d       d
        N
        o       e
                N
                o
        d
        e       d
                e
not immediately available through the Reporting Portal. Such tools are best suited for more technically
        o
        d       o
                d
        e       e
literatedanalysts who are comfortable writing their own queries and do not require fast query response
        e       d
                e
time. e         e

Cloudera (http://www.cloudera.com/) recently unveiled its browser-based Cloudera Desktop product.
This tool simplifies some of the work required to set up, execute, and monitor MapReduce jobs. For the
more technically inclined analysts in ... organization, Cloudera Desktop might be a good fit—even better
than one of the SQL emulators like HBase. Cloudera Desktop’s main features include:
    File Browser – Navigate the Hadoop file system
    Job Browser – Examine MapReduce job states
    Job Designer – Create MapReduce job designs
    Cluster Health – At-a-glance state of the Hadoop cluster

It is also possible to use Hadoop’s MapReduce to generate ―canned reports‖ in batch processing mode.
That is, nightly batch jobs can be scheduled to produce static reports. These reports would consume
data directly from Hadoop, and the resulting content could be pre-formatted for presentation via HTML.
Such reports would effectively by-pass the relational data mart altogether.




8     ...                                            Architecture Proposal                              Confidential
b. Data Marts
          Stated simply, Hadoop can make an excellent contribution as a component of a business
          intelligence solution, but it cannot be the whole solution. A key limitation is that a data
          warehouse is indexed to provide fast query response time, while Hadoop data is not. A data
          warehouse (or data mart) typically contains pre-aggregated metrics in order to deliver selected
          results as fast as possible (i.e., without re-aggregating on the fly). Therefore, a gating factor in
          deciding whether to run analytic queries and reports against Hadoop is the end user’s
          expectation for response time. Since ... customers expect and deserve immediate to near-
          immediate query performance, directly querying Hadoop is not a viable design for the Reporting
          Portal.

          It’s also worth noting here that most of the mature, industry-standard OLAP tools like
          BusinessObjects and MicroStrategy cannot be coupled directly with Hadoop. Therefore, the ...
          reporting infrastructure will still require a traditional, relational, indexed data store containing pre-
          aggregated metrics.

          This data store is rightly called a data mart, because it is not the historical repository of detailed
          data, or system of record. All of its content can be regenerated at any time from the upstream
          data source.

          ... has two basic architectural decisions to make with regard to the data mart. First is whether to
          create one data mart or multiple data marts. The second decision is which brand of RDBMS to
          implement.

                  i. One vs. Many
                 There are a couple of compelling reasons to implement multiple, separate data marts.
                 One reason is performance. The less data you cram into a relational database, the
                 faster it generally performs. There can be exceptions to this rule (like ParAccel’s Analytic
                 Database), but relational databases are usually more responsive with smaller data
                 volumes.

                 A second motivation for splitting ... data into multiple marts is security. It’s certainly quite
                 possible to implement robust security within a single relational database instance, but
                 physically separating each customer’s data definitely ensures that they cannot see one
                 another’s content. However, it is strongly recommended that ... not rely solely on
                 physical separation to enforce data security. There might be situations in which it is not
                 economical store lots of small customers’ data separately. ... should retain the option to
                 co-mingle multiple customers’ data in one database instance, while ensuring privacy to
                 each of them.




9   ...                                      Architecture Proposal                                    Confidential
Figure 4. Multiple Data Marts

                                            Relational
                                            Data Marts



                                                         Customer A




                                                         Customer B




                                                         Customer C
             System of record
             contains all historical
             detail.

                                                …
           A third reason for implementing multiple data marts is customizability. It’s quite possible
           that Customer A might require different kinds of metrics from what Customer B needs.
           One data mart would have to be all things to all customers, making it horribly complex.
           The turnaround time required to add customer-specific metrics would be greatly
           improved by hosting them in a dedicated data mart.

           Having multiple data marts would be very similar to ... current reporting architecture,
           which uses dedicated MySQL schemas to partition customer data.

            ii. Brand of RDBMS
           There are several factors influencing ... choice of relational database management
           system. The primary factor will likely be data volume, which itself is influenced by many
           factors (e.g., data model, historical timeframe, individual customer’s ... log volume).
           Therefore, within the context of this proposal, it is not possible to accurately estimate
           data sizing. Instead, we can provide some basic guidance for future reference.

           From our experience, relatively small volumes (i.e., 10s of GB or less) can be
           comfortably accommodated by MySQL. Medium volumes (up to 100s of GB) are better
           served by Microsoft SQL Server or Oracle. Large volumes 100s of GB to TB-scale)
           require a columnar MPP database like ParAccel Analytical Database, Netezza, Teradata,
           Exadata, or Vertica.

           In addition to data volumes, ... will likely consider cost. MySQL is free, while other
           products can costs hundreds of thousands of dollars to purchase. The cost of a given
           RDBMS may also depend in part of the hardware needed to support it. Some RDBMS
           products only run on certain brands of hardware. Clearly, this can have far-reaching
           ramifications for ... costs of operations. We recommend that ... choose database
           software that can run on any Intel-powered, rackable server. Such hardware will provide
           the most economical scalability path.




10   ...                               Architecture Proposal                              Confidential
Table 1. RDBMS Recommendations
                    Data Volume          Brand                                                                Notes
                    Up to 10s of GB                 MySQL                                        Free, but doesn’t scale well
                     Up to 100s of                                                         Good value for money, easy to run on
                                          Microsoft SQL Server
                          GB                                                                      commodity hardware
                     100s of GB to          ParAccel Analytic                             Powerful, hardware-flexible, negotiable
                          TB                   Database                                               pricing model

            c. Reporting Portal
           ... next generation Reporting Portal could provide its customers with a greatly expanded set of
           features if it is replaced with an industry-standard business intelligence tool like BusinessObjects
           or MicroStrategy.
           The choice of such tool will be essentially driven by how ... customers needs change and more
           importantly if ... start to have bigger corporations with existing IT architecture as client.
           On the short and middle term, an open source tools such as DataVision
           http://datavision.sourceforge.net will be a perfect solution allowing producing custom reports
           easily and generating the result using XML format.
           The XML format will allow to distribute the report almost Operating System agnostic. The only
           requirement will be to have XML file reading capabilities on the platform the reports needs to be
           visualized.

           These web-based tools leverage the power of metadata to enforce security and map business
           metrics to back-end data structures. A metadata-based tool flexibly supports business
           abstractions like categories and hierarchies that are not inherent to the physical data. Business
           intelligence tools offer a rich presentation layer capable of displaying the graphs, charts, and
           pivot tables that business users have come to expect from reporting interfaces.

           Figure 5. Browser-based Front-end

              Relational
              Data Marts                BI Web
                                        Server
                                                                                         Customer’s
                                                                                          Browser
                                                                          Internet




                                                                                     Vendor supported business intelligence
                                                            ... Network




                                                                                     application provides richly featured, web-
                                                                                     based interface. Customers can run
                                      BI Metadata                                    standard and customer reports, ad-hoc
                                      Repository                                     queries, generate charts and graphs, save
                                                                                     results to Excel, etc.




11   ...                                     Architecture Proposal                                                           Confidential
By leveraging a mature front-end technology, ... gains the advantage of reducing its internal
           Java development effort, while giving its customers a greatly expanded set of reporting and
           OLAP functionality. There a many products on the market, some cheaper and less mature than
           the long-standing industry leaders, Business Objects XI 3.1 and Micro Strategy 9. Our
           recommendation to ... is to be willing to invest in this customer-facing component so that it
           reinforces the most appealing impression in its end users.

           d. Hardware
           All of the technologies outlined thus far will run quite well on the type of hardware that ...
           currently uses to serve the Reporting Portal’s data warehouse. ... could purchase several more
           of the rackable Dell PowerEdge 2950 server trays running Windows Server 2003 and array
           them as a Hadoop cluster, data mart hosts, or web servers. Operational considerations like
           data center space and power notwithstanding, this hardware choice would preserve ... current
           SOE (standard operating environment), and minimize retraining of operations staff.

            e. Java Programming
           On reason that the Hadoop technology was selected is the high degree of skill and experience
           that ... personnel have with Java programming. As discussed earlier, interfaces into and out of
           Hadoop will most likely be coded in Java. These interfaces would likely be designed,
           developed, tested, and supported by ... personnel. At first blush, this statement might raise
           concerns about the cost of hand-coding data interfaces, versus buying a vendor-supported
           product. However, there are currently no data integration products available on the market to
           perform these tasks. Furthermore, if an off-the-shelf data integration (ETL) tool like Informatica
           PowerCenter could be purchased, it would still require expensive consulting services to
           implement and support. Net net, programming these interfaces in Java is actually a very logical
           choice for ....

5. Data Anomaly Detection
In addition, thanks to its extensive analytics capabilities and performances, Hadoop allows doing
different kind of deep analysis to define and then detect data anomaly patterns and report them in
minutes.
You’ll find attached several documents describing different anomaly approaches. In addition, there is a
lot of information available on Hadoop Wiki such as
http://wiki.apache.org/hadoop/Anomaly_Detection_Framework_with_Chukwa describing Chukwa
framework to detect anomalies.

6. Data integration/importation and Data Quality Management
As an alternative using Hadoop ETL features, Cloudera (open source editor of Hadoop) and Talend
(open source ETL tool – Extract Transform and Load) recently announced a technology partnership
http://www.cloudera.com/company/press-
center/releases/talend_and_cloudera_announce_technology_partnership_to_simplify_processing_of_la
rge_scale_data.
Talend is the recognized market leader in open source data management.
Talend’s solutions and services allow minimizing the costs and maximizing the value of data
integration, ETL, data quality and master data management.
We highly recommend using Talend as the dedicated tool for data integration, ETL and data quality.




12   ...                                    Architecture Proposal                                Confidential
7. Summary
Based on key factors like terabyte-scale data volumes, log files as data source, and customer-facing
OLAP, the optimal architecture for ... Reporting Portal infrastructure comprises a cloud computing
model with distributed file storage; distributed processing; optimized, relational data marts; and an
industry-leading, web-based, metadata-driven business intelligence package. The cloud computing
architecture affords ... virtually unlimited, linear scalability that can grow economically with demand.
Relational data marts ensure excellent query performance and low-risk flexibility for adding metrics,
changing reporting hierarchies, etc.




13   ...                                Architecture Proposal                               Confidential
Appendix A. Hadoop Overview
Due to their sheer size, large applications like ...s data warehouse require more resources than can
typically be served by a single, cost-effective machine. Even if a large, expensive server could be
configured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machine
could provide the continuous, uninterrupted operation needed by today’s full-time applications. The
Hadoop open-source framework—or Hadoop Common, as it is now officially known—is a Java cloud
computing architecture designed as an economical, scalable solution that provides seamless fault
tolerance for large data applications.

Hadoop is a top-level Apache Software Foundation project, being built and used by a community of
contributors from all over the world. As such, Hadoop is not a vendor-supported software package. It is a
development framework that requires in-depth programming skills to implement and maintain. Therefore,
an organization that chooses to deploy Hadoop will need to employ skilled personnel to maintain the
cluster, program MapReduce jobs, and develop input/output interfaces.

Hadoop Common runs applications on large, high-availability clusters of commodity hardware. It
implements a computational paradigm named MapReduce, where the application is divided into many
small fragments of work, each of which may be executed on any node in the cluster. In addition, Hadoop
Common provides a distributed file system (HDFS) that stores data on the compute nodes, providing very
high aggregate bandwidth across the cluster. Both MapReduce and HDFS are designed so that node
failures are automatically handled by the framework.

   MapReduce
   Hadoop supports the MapReduce parallel processing model, which was introduced by Google as a
   method of solving a class of petabyte-scale problems with large clusters of inexpensive machines.
   MapReduce is a programming paradigm that expresses a large distributed computation as a sequence
   of distributed operations on data sets of key/value pairs. The Hadoop MapReduce framework
   harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the
   cluster. A MapReduce computation has two phases, a map phase and a reduce phase (see Figure A-1
   below).

       Map
       In the map phase, the framework splits the input data set into a large number of fragments and
       assigns each fragment to a map task. The framework also distributes the many map tasks across
       the cluster of nodes on which it operates. Each map task consumes key/value pairs from its
       assigned fragment and produces a set of intermediate key/value pairs. For each input key/value
       pair (K,V), the map task invokes a user defined map function that transmutes the input into a
       different key/value pair (K',V').

       Following the map phase the framework sorts the intermediate data set by key and produces a set
       of (K',V'*) tuples so that all the values associated with a particular key appear together. It also
       partitions the set of tuples into a number of fragments equal to the number of reduce tasks.
       Reduce
       In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For
       each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output
       key/value pair (K,V). Once again, the framework distributes the many reduce tasks across the
       cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each
       reduce task.




  14    ...                                Architecture Proposal                              Confidential
Tasks in each phase are executed in a fault-tolerant manner. If node(s) fail in the middle of a
computation the tasks assigned to them are re-distributed among the remaining nodes. Having many
map and reduce tasks enables efficient load balancing and allows failed tasks to be re-run with small
runtime overhead.

The Hadoop MapReduce framework has a master/slave architecture comprising a single master server
or JobTracker and several slave servers or TaskTrackers, one per node in the cluster. The master
node manages the execution of jobs, which involves assigning small chunks of a large problem to many
nodes. The master also monitors node failures and substitutes other nodes as needed to pick up
dropped tasks. The JobTracker is the point of interaction between users and the framework. Users
submit MapReduce jobs to the JobTracker, which puts them in a queue of pending jobs and executes
them on a first-come, first-served basis. The JobTracker manages the assignment of map and reduce
tasks to the TaskTrackers. The TaskTrackers execute tasks upon instruction from the JobTracker and
also handle data motion between the Map and Reduce phases.




15   ...                               Architecture Proposal                             Confidential
Figure A-1. MapReduce Model

                                                                                                     Input Data Set
       Record

                 Record

                                   Record

                                                     Record

                                                                       Record

                                                                                         Record

                                                                                                           Record

                                                                                                                     Record

                                                                                                                              Record

                                                                                                                                       Record

                                                                                                                                                Record

                                                                                                                                                         Record

                                                                                                                                                                  Record

                                                                                                                                                                            Record

                                                                                                                                                                                     Record
                   Split                                                                                               Split                                                  Split




                                                                                                                                                                                              Phase
                                                                                                                                                                                               Map
                Map Task                                                                                            Map Task                                               Map Task

                               Value0                                                                  Key3                   Value6                              Key1               Value1
                               ValueA                                                                                                                             Key7               Value2
      Key5                     ValueB                                                                  Key2                   Value7
                                                                                                       Key2                   Value8                              Key2               Value3
      Key6                     ValueC                                                                  Key4                   Value9                              Key4               Value4
      Key8                     ValueD                                                                                                                             Key8               Value5

                                                 Shuffle                                                                                                 Shuffle
                                                  And                                                                                                     And




                                                                                                                                                                                              Intermediate
                                                  Sort                                                                                                    Sort




                                                                                                                                                                                                 Phase
                                                                       Value0                                                                                     Value3
                               Key1
                                                                       Value1                                                                   Key2              Value7
                                                                                                                                                                  Value8
                                                                   ValueA
                               Key3
                                                                   Value6                                                                                         Value4
                                                                                                                                                Key4
                                                                                                                                                                  Value9
                               Key5                                ValueB
                                                                                                                                                Key6              ValueC
                               Key7                                    Value2
                                                                                                                                                                  Value5
                                                                                                                                                Key8
                                                                                                                                                                  ValueD
                                                                                                                                                                                              Reduce




                                                Reduce                                                                                                   Reduce
                                                                                                                                                                                              Phase




                                                 Task                                                                                                     Task
                          Record

                                            Record

                                                              Record

                                                                                Record

                                                                                                  Record




                                                                                                                                       Record

                                                                                                                                                Record

                                                                                                                                                         Record

                                                                                                                                                                  Record

                                                                                                                                                                            Record




                                                                                                     Output Data Set


Hadoop Distributed File System (HDFS)
Hadoop's Distributed File System (HDFS) is designed to reliably store very large files across clustered
machines. It is inspired by the Google File System (GFS). HDFS sits on top of the native operating
system’s file system and stores each file as a sequence of blocks. All blocks in a file except the last
block are the same size. Blocks belonging to a file are replicated across machines for fault tolerance.
The block size and replication factor are configurable per file. Files in HDFS are "write once, read
many" and have strictly one writer at any time.


16   ...                                                                                                                               Architecture Proposal                                                 Confidential
Like Hadoop MapReduce, HDFS follows a master/slave architecture, made up of a robust master node
and multiple data nodes (see Figure A-2 below). An HDFS installation consists of a single NameNode,
a master server that manages the file system namespace and regulates access to files by clients. In
addition, there are a number of DataNodes, one per node in the cluster, which manage storage
attached to the nodes that they run on. The NameNode makes file system namespace operations like
opening, closing, and renaming of files and directories available via an RPC interface. It also
determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and
write requests from file system clients. They also perform block creation, deletion, and replication upon
instruction from the NameNode.

Figure A-2. HDFS Model

                                       Client


                                       Switch
                                                        1 Gbit


              Switch                                               Switch
                                100 Mbit                                            100 Mbit




               Rack                                                Rack
                   JobTracker                                        TaskTracker/
                                                                       DataNode


                   NameNode                                          TaskTracker/
                                                                       DataNode


                 TaskTracker/                                        TaskTracker/
                   DataNode                                            DataNode


                 TaskTracker/                                        TaskTracker/
                   DataNode                                            DataNode


                 TaskTracker/                                        TaskTracker/
                   DataNode                                            DataNode


                 TaskTracker/                                        TaskTracker/
                   DataNode                                            DataNode




17   ...                                   Architecture Proposal                               Confidential
8. Query Optimization
Our recommendation is have a deep dive on the worst performing queries focusing on the ones running
frequently.
On the other hand moving most of the analytics from the MySQL production database to Hadoop will
reduce the data volume and the load of the MySQL database.
This will necessarily imply a performance improvement.

9. Access and Data Security
During our discussions it was mentioned some efforts would be needed to better protect and encrypt the
URL used to access the different website pages.
In addition, we’ve suggested for future use to secure the data themselves doing some encryption.

10.    Internal Management and Collaboration tools
Sales Force appears to be the recommend choice in regards of its numerous management and
collaboration features. It includes all the capabilities required: Contact management; Project management
and time tracking; Technical Support Management … :




Sales Force Professional is $65 /user/month = $3,900 (2 846€) per year for 5 users




  18     ...                               Architecture Proposal                               Confidential
11.    Sales Force and Force.com integration
In addition, Sales Force offers a complete API named Force.com allowing integrating features on your
existing platform.
This API will allow for future use an easy way to integrate new features to ... application, such as mobile
device support; interface with existing application using AppsExchange; Real-Time Analytics …




  19     ...                                 Architecture Proposal                                 Confidential
12.    Roadmap
Hadoop Installation and configuration takes no more than 2 days for one person (see ―Building and
Installing Hadoop-MapReduce‖ PDF file).

We recommend taking seriously the design phase to build strong foundations of your future architecture.

Your customers Datamart should take no more than a month for a full implementation.
Regarding your internal Datamart the implantation time will depend on how deep you want to go in
analytics, however gaining experience by implementing the customer Datamart this shouldn’t be longer
than a month.

Of course, we’ll be able to assist you as needed to follow up on your future architecture implementation.

Cloudera is also providing different services on Hadoop:
Professional Services (http://www.cloudera.com/hadoop-services)
   Best practices for setting up and configuring a cluster suitable to run Cloudera’s Distribution for
   Hadoop:
    Choice of hardware, operating system, and related systems software
    Configuration of storage in the cluster, including ways to integrate with existing storage repositories
    Balancing compute power with storage capacity on nodes in the cluster
   A comprehensive design review of your current system and your plans for Hadoop:
    Discovery and analysis sessions aimed at identifying the various data types and sources streaming
       into your cluster
    Design recommendations for a data-processing pipeline that addresses your business needs
   Operational guidance for a cluster running Hadoop, including:
    Best practices for loading data into the cluster and for ensuring locality of data to compute nodes
    Identifying, diagnosing, and fixing errors in Hadoop and the site-specific analyses our customers run
    Tools and techniques for monitoring an active Hadoop cluster
    Advice on the integration of MapReduce job submission into an existing data-processing pipeline,
       so Hadoop can read data from, and write data to, the analytic tools and databases our customers
       already use
    Guidance on the use of additional analytic or developmental tools, such as Hive and Pig, that offer
       high-level interfaces for data evaluation and visualization
   Hands-on help in developing Hadoop applications that deliver the data-processing and analysis you
   need.
   How to connect Hadoop to your existing IT infrastructure. We can help with moving data between
   Hadoop and data warehouses, collecting data from file systems, creating document repositories,
   logging infrastructure and other sources, and setting up existing visualization and analytic tools to work
   with Hadoop.
   Performance audits of your Hadoop cluster, with tuning recommendations for speed, throughput, and
   response times




  20    ...                                 Architecture Proposal                                Confidential
Training (http://www.cloudera.com/hadoop-training)
Cloudera offers numerous on-line training resources and live public sessions:
Developer Training and Certification
       Cloudera offers a three-day training program targeted toward developers who want to learn how
       to use Hadoop to build powerful data processing applications.
       Over three days, this course will assume only a casual understanding of Hadoop and teach you
       everything you need to know to take advantage of some of the most powerful features. We’ll get
       into deep details about Hadoop itself, but also devote ample time for hands-on exercises,
       importing data from existing sources, working with Hive and Pig, debugging MapReduce and
       much more. A full agenda is on the registration page. This course includes the certification exam
       to become Cloudera Certified Hadoop Developer.
Sysadmin Training and Certification
       Systems administrators need to know how Hadoop operates in order to deploy and manage
       clusters for their organizations. Cloudera offers a two-day intensive course on Hadoop for
       operations staff. The course describes Hadoop’s architecture, covers the management and
       monitoring tools most commonly used to oversee it, and provides valuable advice on setting up,
       maintaining and troubleshooting Hadoop for development and production systems. This course
       includes the certification exam to become Cloudera Certified Hadoop Administrator.
HBase Training
       Use HBase as a distributed data store to achieve low-latency queries and highly scalable
       throughput. HBase training covers the HBase architecture, data model, and Java API as well as
       some advanced topics and best practices. This training is for developers (Java experience is
       recommended) who already have a basic understanding of Hadoop




21   ...                                Architecture Proposal                               Confidential

Mais conteúdo relacionado

Mais procurados

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Insights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service DiscoveryInsights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service DiscoveryNicolas Navet
 
Introduction to MQTT
Introduction to MQTTIntroduction to MQTT
Introduction to MQTTEMQ
 
AWS Elemental Services for Video Processing and Delivery
AWS Elemental Services for Video Processing and DeliveryAWS Elemental Services for Video Processing and Delivery
AWS Elemental Services for Video Processing and DeliveryAmazon Web Services
 
IBM MQ High Availability 2019
IBM MQ High Availability 2019IBM MQ High Availability 2019
IBM MQ High Availability 2019David Ware
 
Appium Automation with Kotlin
Appium Automation with KotlinAppium Automation with Kotlin
Appium Automation with KotlinRapidValue
 
Load Balancing in Cloud
Load Balancing in CloudLoad Balancing in Cloud
Load Balancing in CloudMphasis
 
Cloud computing
Cloud computingCloud computing
Cloud computingkanchu17
 
cloud-migrations.pptx
cloud-migrations.pptxcloud-migrations.pptx
cloud-migrations.pptxJohn Mulhall
 
Infor ION Executive Presentation Overvide Demo
Infor ION Executive Presentation Overvide DemoInfor ION Executive Presentation Overvide Demo
Infor ION Executive Presentation Overvide DemoGodlan, Inc
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Amazon Web Services
 
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATION
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATIONIBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATION
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATIONKellton Tech Solutions Ltd
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Tomasz Bednarz
 
Netskope Overview
Netskope OverviewNetskope Overview
Netskope OverviewNetskope
 
Post transaction cloud value creation
Post transaction cloud value creation Post transaction cloud value creation
Post transaction cloud value creation Tom Laszewski
 
Net cracker resource_inventory
Net cracker resource_inventoryNet cracker resource_inventory
Net cracker resource_inventoryPrasant Kella
 

Mais procurados (20)

Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Insights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service DiscoveryInsights on the configuration and performances of SOME/IP Service Discovery
Insights on the configuration and performances of SOME/IP Service Discovery
 
Introduction to MQTT
Introduction to MQTTIntroduction to MQTT
Introduction to MQTT
 
AWS Elemental Services for Video Processing and Delivery
AWS Elemental Services for Video Processing and DeliveryAWS Elemental Services for Video Processing and Delivery
AWS Elemental Services for Video Processing and Delivery
 
IBM MQ High Availability 2019
IBM MQ High Availability 2019IBM MQ High Availability 2019
IBM MQ High Availability 2019
 
Appium Automation with Kotlin
Appium Automation with KotlinAppium Automation with Kotlin
Appium Automation with Kotlin
 
Load Balancing in Cloud
Load Balancing in CloudLoad Balancing in Cloud
Load Balancing in Cloud
 
Privacy issues in the cloud
Privacy issues in the cloudPrivacy issues in the cloud
Privacy issues in the cloud
 
Linux one vs x86
Linux one vs x86 Linux one vs x86
Linux one vs x86
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
cloud-migrations.pptx
cloud-migrations.pptxcloud-migrations.pptx
cloud-migrations.pptx
 
Infor ION Executive Presentation Overvide Demo
Infor ION Executive Presentation Overvide DemoInfor ION Executive Presentation Overvide Demo
Infor ION Executive Presentation Overvide Demo
 
6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol6LoWPAN: An Open IoT Networking Protocol
6LoWPAN: An Open IoT Networking Protocol
 
Cloud Computing Using OpenStack
Cloud Computing Using OpenStack Cloud Computing Using OpenStack
Cloud Computing Using OpenStack
 
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
Next generation intelligent data lakes, powered by GraphQL & AWS AppSync - MA...
 
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATION
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATIONIBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATION
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATION
 
Introduction to OpenCL, 2010
Introduction to OpenCL, 2010Introduction to OpenCL, 2010
Introduction to OpenCL, 2010
 
Netskope Overview
Netskope OverviewNetskope Overview
Netskope Overview
 
Post transaction cloud value creation
Post transaction cloud value creation Post transaction cloud value creation
Post transaction cloud value creation
 
Net cracker resource_inventory
Net cracker resource_inventoryNet cracker resource_inventory
Net cracker resource_inventory
 

Destaque

Extreme (Web 2.0) Lesson Plan Makeover
Extreme (Web 2.0) Lesson Plan MakeoverExtreme (Web 2.0) Lesson Plan Makeover
Extreme (Web 2.0) Lesson Plan MakeoverDarren Kuropatwa
 
Warehousing proposal
Warehousing proposalWarehousing proposal
Warehousing proposalzulfimac
 
Warehouse management using rfid
Warehouse management using rfidWarehouse management using rfid
Warehouse management using rfidSaurav Suman
 
Nomo1 Database Proposal Final
Nomo1 Database Proposal FinalNomo1 Database Proposal Final
Nomo1 Database Proposal FinalJosh Wentz
 
Project Proposal Sample: RFID on Warehouse Management System
Project Proposal Sample: RFID on Warehouse Management SystemProject Proposal Sample: RFID on Warehouse Management System
Project Proposal Sample: RFID on Warehouse Management SystemCheri Amour Calicdan
 
Inventory management
Inventory managementInventory management
Inventory managementsaurabhsabiba
 
Inventory Control Final Ppt
Inventory Control Final PptInventory Control Final Ppt
Inventory Control Final Pptrajnikant
 
Online supply inventory system
Online supply inventory systemOnline supply inventory system
Online supply inventory systemrokista
 
Inventory system
Inventory systemInventory system
Inventory systemsai prakash
 

Destaque (11)

Extreme (Web 2.0) Lesson Plan Makeover
Extreme (Web 2.0) Lesson Plan MakeoverExtreme (Web 2.0) Lesson Plan Makeover
Extreme (Web 2.0) Lesson Plan Makeover
 
Database Proposal
Database ProposalDatabase Proposal
Database Proposal
 
Database Proposal
Database ProposalDatabase Proposal
Database Proposal
 
Warehousing proposal
Warehousing proposalWarehousing proposal
Warehousing proposal
 
Warehouse management using rfid
Warehouse management using rfidWarehouse management using rfid
Warehouse management using rfid
 
Nomo1 Database Proposal Final
Nomo1 Database Proposal FinalNomo1 Database Proposal Final
Nomo1 Database Proposal Final
 
Project Proposal Sample: RFID on Warehouse Management System
Project Proposal Sample: RFID on Warehouse Management SystemProject Proposal Sample: RFID on Warehouse Management System
Project Proposal Sample: RFID on Warehouse Management System
 
Inventory management
Inventory managementInventory management
Inventory management
 
Inventory Control Final Ppt
Inventory Control Final PptInventory Control Final Ppt
Inventory Control Final Ppt
 
Online supply inventory system
Online supply inventory systemOnline supply inventory system
Online supply inventory system
 
Inventory system
Inventory systemInventory system
Inventory system
 

Semelhante a Database Architecture Proposal

Traditional data word
Traditional data wordTraditional data word
Traditional data wordorcoxsm
 
Introduction to Modern Data Virtualization 2021 (APAC)
Introduction to Modern Data Virtualization 2021 (APAC)Introduction to Modern Data Virtualization 2021 (APAC)
Introduction to Modern Data Virtualization 2021 (APAC)Denodo
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)Denodo
 
critical_capabilities_for_ob_271719 copy
critical_capabilities_for_ob_271719 copycritical_capabilities_for_ob_271719 copy
critical_capabilities_for_ob_271719 copyChris Woeppel
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopDatabricks
 
La creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDBLa creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDBMongoDB
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Denodo
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primerpartha69
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
 
Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...
Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...
Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...EMC
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017Jeremy Maranitch
 
Attributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner CatalystAttributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner CatalystJack Mardack
 
Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...
Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...
Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...Kai Wähner
 
Workload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterWorkload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterCloudian
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture Mark Hewitt
 

Semelhante a Database Architecture Proposal (20)

Traditional data word
Traditional data wordTraditional data word
Traditional data word
 
Introduction to Modern Data Virtualization 2021 (APAC)
Introduction to Modern Data Virtualization 2021 (APAC)Introduction to Modern Data Virtualization 2021 (APAC)
Introduction to Modern Data Virtualization 2021 (APAC)
 
A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)A Logical Architecture is Always a Flexible Architecture (ASEAN)
A Logical Architecture is Always a Flexible Architecture (ASEAN)
 
critical_capabilities_for_ob_271719 copy
critical_capabilities_for_ob_271719 copycritical_capabilities_for_ob_271719 copy
critical_capabilities_for_ob_271719 copy
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
La creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDBLa creación de una capa operacional con MongoDB
La creación de una capa operacional con MongoDB
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primer
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...
Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...
Data Warehouse Scalability Using Cisco Unified Computing System and Oracle Re...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017
 
2011 keesvan gelder
2011 keesvan gelder2011 keesvan gelder
2011 keesvan gelder
 
Attributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner CatalystAttributes of a Modern Data Warehouse - Gartner Catalyst
Attributes of a Modern Data Warehouse - Gartner Catalyst
 
Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...
Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...
Real World Use Cases and Success Stories for In-Memory Data Grids (TIBCO Acti...
 
Workload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterWorkload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation Datacenter
 
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
Maginatics @ SDC 2013: Architecting An Enterprise Storage Platform Using Obje...
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 

Último

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Último (20)

React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Database Architecture Proposal

  • 1. <Client – Confidential> Architecture Proposal Prepared By Bernard Dedieu
  • 2. Table of Content 1. Background ............................................................................................................ 3 2. Problem Statement ................................................................................................. 3 3. Proposed Architecture—High-Level ........................................................................ 4 a. 100GB-scale Data Volume ..................................................................................... 4 b. Log Files as Data Source ........................................................................................ 4 c. Customer-facing OLAP ........................................................................................... 4 4. Proposed Architecture—Low-Level ......................................................................... 6 a. Hadoop ................................................................................................................... 6 b. Data Marts .............................................................................................................. 9 i. One vs. Many.......................................................................................................... 9 ii. Brand of RDBMS .................................................................................................. 10 c. Reporting Portal .................................................................................................... 11 d. Hardware .............................................................................................................. 12 e. Java Programming ................................................................................................ 12 5. Data Anomaly Detection ....................................................................................... 12 6. Data integration/importation and Data Quality Management ................................. 12 7. Summary .............................................................................................................. 13 Appendix A. Hadoop Overview ................................................................................... 14 MapReduce ................................................................................................................ 14 Map............................................................................................................................. 14 Reduce ....................................................................................................................... 14 Hadoop Distributed File System (HDFS) ..................................................................... 16 8. Query Optimization ............................................................................................... 18 9. Access and Data Security ..................................................................................... 18 10. Internal Management and Collaboration tools ....................................................... 18 11. Sales Force and Force.com integration ................................................................ 19 12. Roadmap .............................................................................................................. 20 2 ... Architecture Proposal Confidential
  • 3. 1. Background <Company presentation and background – Confidential> 2. Problem Statement In term of load for the database, the number of sites is the best metric since it describes the number …. So it is very important that the web-application remains effective as the company is growing (this includes the database, the framework and the architecture of the servers). Also, as the company grows in … need to deploy a server in Europe to manage ... In addition, the historical data will be kept and the number of ... will grow the data volume will grow exponentially. So the overall database architecture needs to be highly and easily scalable. It is also more than likely as the solution price will decrease, bigger corporations will be interested in ... solution. Therefore, ... solution will need to be integrated in existing information systems. This will require:  To interface ... solution to existing applications.  To have ... solution relying on standard and open technologies.  To build partnership with System Integrators or build an internal Professional Services organization to support these customers. With its current, somewhat limited database schema, the data warehouse’s millions of records consume more than 2GB of disk space, including indexes. Extensions to the data warehouse schema, coupled with a growing customer base, will easily push the data warehouse volume beyond 100GB. The single instance, multi-schema MySQL database architecture simply does not provide the scalability necessary to meet ... demands. In addition to these scalability problems, the reporting infrastructure is also limited in its potential for enhanced functionality. For instance, ... would like to extend the Reporting Portal to provide customers with ad-hoc, multi-dimensional query capability and custom reporting based on searchable attribute tags in the data warehouse. At present, the data warehouse dimensions do not provide the flexibility needed to easily accommodate these kinds of changes. Therefore, ... has a pressing need to replace its current reporting infrastructure with a scalable, flexible architecture that can not only accommodate their growing data volumes, but also dramatically extend their reporting functionality. Key goals for the new infrastructure include:  Redundant, efficient retention of historical detail o Write once, ready many o Compression o No encryption required o ANSI-7 single-byte code page is sufficient  Linear scalability (i.e., as data volume increases, performance is not degraded)  Flexible extensibility (e.g., attributes can easily be added and exposed to customers for reporting, either as dimensional attributes or fact attributes) 3 ... Architecture Proposal Confidential
  • 4. Full OLAP support o Standard reports o Custom reports o Ad-hoc query o Multi-dimensional o Hierarchical categories (i.e., tagging, snowflakes) o Charts and graphs o Drill-down to atomic detail (i.e., ... log) o 24x7 availability o Query response time measured in seconds (not minutes)  Efficient ETL o Near real time (i.e., < 15 minutes) o Handles fluctuating volumes throughout the day without becoming a bottleneck (which can cause synchronization problems in the data warehouse)  Partitioning of data by customer This new architecture must deliver vastly improved functionality, while controlling for implementation cost and time to roll-out. 3. Proposed Architecture—High-Level From an architectural perspective, there are three overarching factors driving the technical solution for ... reporting needs: a. 100GB-scale Data Volume Due to their sheer size, large applications like ...s data warehouse require more resources than can typically be served by a single, cost-effective machine. Even if a large, expensive server could be configured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machine could provide the continuous, uninterrupted operation needed to meet ... SLAs. A cloud computing architecture, on the other hand, is an economical, scalable solution that provides seamless fault tolerance for large data applications. b. Log Files as Data Source More and more organizations are seeking to leverage the rich content in their verbose log files to drive business intelligence. Sourcing from log files presents a different set of challenges compared to selecting data out of a highly structured OLTP database. Efficient, robust, and flexible parsing routines most be programmed to identify tagged attributes and map these to business constructs in the data warehouse. And because log files tend to consume lots of disk space, they should ideally be stored in a distributed file system in order to load balance I/O and improve fault tolerance. c. Customer-facing OLAP The stakes are usually higher when building and maintaining a customer-facing business intelligence solution, as opposed to one that is implementing internally. ... reputation and marketability depend in part on its customers’ opinions of the Reporting Portal. It must be intuitive, easy to use, powerful, secure, and available anytime. Its data should be as fresh as possible, while providing historical data for trend analyses. Customers should have seamless access to both aggregated metrics and ... log detail. The Reporting Portal should expose the customizability of the speech application through its reports. Any customer-specific categories, tags, and data content should be faithfully reflected in the Reporting Portal, just as the customer would expect to see them. 4 ... Architecture Proposal Confidential
  • 5. Based on these driving factors, we propose a cloud computing architecture comprising a distributed file system, distributed file processing, one or more relational data marts, and a browser-based OLAP package (see Figure 1). Most of this infrastructure will be built using open source software technologies running on commodity hardware. This strategy keeps initial implementation costs low for a right-sized solution, while providing a path for scalable growth. Figure 1. High-Level Architecture ... logs Hadoop Distributed Relational Reporting File System (HDFS) Data Mart(s) Portal ... logs are retained ...logs are Any portion of Reports, ad- for ever (or as immediately historical data can hoc queries, otherwise specified replicated into be read from graphs, and per customer HDFS and can be Hadoop and charts are requirements). retained indefinitely. aggregated as needed into presented via optimized reporting browser-based database(s). software. In this design, Apache Hadoop (http://hadoop.apache.org/) is used to perform some of the functions normally provided by a relational data warehouse. Most specifically, Hadoop behaves as the system of record, storing all of the historical detail generated by the Speech Applications. New ... logs are immediately replicated into the Hadoop Distributed File System (HDFS), which is massively scalable to accommodate virtually any amount of data. HDFS is based on Google’s GFS, which essentially stores the content of the Web in order to facilitate index generation. Other well-known companies that store huge volumes on data in HDFS include Yahoo!, AOL, Facebook, and Amazon. Hadoop is free to download and install. It uses a cloud computing architecture (i.e., lots of inexpensive computers linked together, sharing workload), so it can be easily and economically extended as needed to scale for growth. Scaling performance is linear; performance does not degrade as you increase data volume. Hadoop cannot fulfill all of the functions of a data warehouse, though. For instance, it does not contain indexes like a relational database, so it can’t truly be optimized to return query results quickly. Hadoop provides a very powerful, distributed job processing technology called MapReduce, which can perform much of the extract and transform work that is commonly done by ETL tools. Therefore, Hadoop powerfully augments ... business intelligence architecture by using distributed storage and processing to perform the data warehousing functions that would otherwise be the hardest to scale under a traditional, single-machine, relational data warehouse architecture. 5 ... Architecture Proposal Confidential
  • 6. While Hadoop does the ―heavy lifting,‖ other, more traditional technologies are used to provide familiar business intelligence functionality. Relational data marts serve up optimized OLAP database schemas (e.g., highly indexed star schemas) for querying via standard business intelligence tools. One defining factor of a data mart is that it can be completely truncated and reloaded from the upstream data repository (in this case, Hadoop) as needed. This means that if ... needs to enhance the reporting database design by altering a dimension or adding new metrics, the data mart’s schema can altered— even dramatically—and repopulated without the risk of losing any historical data. It’s also worth noting that because the Hadoop repository stores all historical detail, it is possible to retroactively back- populate new metrics that re added to the data mart(s). As of this writing, it is not know how much data volume must be accommodated in a given data mart. And we don’t yet know whether one data mart would suffice, or if there would be many data marts. These questions will influence the choice of relational database management system (RDBMS) that is selected for .... For example, MySQL is cheap to procure and implement, but has serious scalability limitations. A columnar MPP database like ParAccel is ideal for handling multi-terabyte data volumes, but comes with a price tag. One advantage of this proposed architecture, though, is that the data marts can be migrated from one technology to another without risk of losing valuable data. The customer-facing front end technology should be a mature, fully-supported product like BusinessObjects or MicroStrategy. Such technologies are rich with features that would otherwise be very costly to develop in-house, even with open source Java libraries. Besides, the customers who use this interface should not become quality assurance testers for internally developed user interfaces. The Reporting Portal is a marketed service and as such, must leave customers with a great impression. 4. Proposed Architecture—Low-Level This section outlines an in-depth look at each component in Figure 1 above. a. Hadoop Hadoop is an extremely powerful open source technology that does certain things very well, like store immense volumes of data and perform distributed computations on that data. Some of these strengths can be leveraged within the context of a business intelligence application. For instance, several of the functions that would normally be performed within a traditional data warehouse could be taken up by Hadoop. One defining feature of a data warehouse is that it stores historical data. While source systems may only keep a rolling window of recent data, the data warehouse retains all or most of the history. This frees up the transactional systems to efficiently run the business, while keeping a historical system of record in the data warehouse. HDFS is ideal for archiving large volumes of static data, such as ... ... logs. HDFS provides linear scalability as data volumes increase. Not only can HDFS easily handle ... for ever retention requirement, but if could also permit ... to retain all of its history. HDFS comfortably scales into the petabyte range, so the need to age out and purge files could be eliminated altogether. Hadoop is a perfect solution for historicity problems, because it easily scales to petabyte sizes by simply by configuring additional hardware into the cluster. Another benefit of HDFS is its data redundancy. HDFS replicates file blocks across nodes, which can physically reside in the same data center or in another data center (assuming the VPN bandwidth supports it). This would entirely eliminate the need for ... to copy zipped ... log files between data centers (see Figure 2). 6 ... Architecture Proposal Confidential
  • 7. Figure 2. ... Log-Hadoop Architecture VXML … Logs or or or or or or or or or or or or or or or R R R R R R R R R R R R R R R e e e e e e e e e d e d e d e d e d e d e d d d d d d d d d c c c c c c c c c c c c c c c S S S V V KV KV KV KpV KpV KpV RJ R T Klial KliV Klial V e V K al K al K al V V e al e al S al e V Na o a T e KSV e al K e al e tal e tal e tal al e u e u u Vy u y u V a y K Vy uK y u u y u hu u y hu y K al V y V A Java program a T b ck T ck a y e al 2 e K 1 u uV uV y V e al 2 e e 7 u al y 5 Kffl 3 e Kffl e e al e 4 ee 2 e V al e al V T s T 6 eRu 6 eRu e e al 8 e al e u y u 7e 8 e 0 V al e 4 1 al 2 Aeu 8 y eal 3 reads each ... log ma a u u or or or or or or or or or or T y eu 9 y eal 4 B R R R R R R R R R R s e e e e e e e d e d e d e d d d d d d d al 5 c c c c c c c c c c k Cu u T a e T a y e D 1 e y e 2 u y e yAeu e and writes it into s Tr s k T 3Ae 5d0e 4du 6 e u e 3 e HDFS for Na s a s 7 A nd 1 e 6 8 7 nd8 e e 4 k HDFS T a a k a uB 2 9 uC permanent s k s kr MapReduce S S5D o T T c(Figure A-2) s sr c c (Figure A-1) storage. k T a k T ort e ort e d kr a kr Tr c Tr T T e a e T a c T a a ar k ar cr r c kr s s a c e a c k k k The a k e Hadoop Distributed File System (HDFS) can a c k r/ c k transparently replicate data e c e be configuredr/ cto k e D k e r/ k r/ across racks D across data centers, providing and k e r/ a e r/ redundant failover copies of all file blocks. De D a e Dr/ t r/ D a r/ at r/ Da a D a D t a Dt at N at Although business intelligence solutions depend on lots of data, business users are interested in a a a N a at ot a of raw data into meaningful business metrics, information. In order to transform large volumesN N t o dt a a N must be applied, and large numbers of data elements N calculations must be performed, business rules o a o d a No e N o must be summarized into a few figures. Nd d e N o d o d e o e o d e d e Traditionally, this type of aggregation worke d is done outside of the data warehouse by an extract, d e e e transform, and load (ETL) tool, or within the data warehouse using stored procedures and materialized views. Due to the inherent constraints imposed by a relational database system like MySQL, there are limits to how much data can reasonably be aggregated this way. As source data volumes increase, the time required to perform aggregations can extend beyond the point in time when the resulting metrics are needed by the customers. Hadoop is able to perform these kinds of aggregations much quicker on large data volumes because it distributes the processing across many computers, each one crunching the numbers for a subset of the source data. Consequently, aggregated metrics that might have taken days to calculate in a traditional data warehouse model can be churned out by Hadoop in a couple of hours or even minutes. MapReduce is particularly well-suited to structured data sets like ... ... logs. Tagged attributes map easily to key/value pairs, which are the transactional unit of MapReduce jobs (see Figure A-1 in appendix). ... ETL routines could therefore be replaced with Java MapReduce jobs read from HDFS ... log files and write to the data marts (see Figure 3). 7 ... Architecture Proposal Confidential
  • 8. Figure 3. Hadoop MapReduce Architecture Other Tools … Relational or or or or or or or or or or or or or or or R R R R R R R R R R R R R R R e e e e e e e e e d e d e d e d e d e d e d d d d d d d d d c c c c c c c c c c c c c c c S S S Data Mart(s) RJ R KpV KpV KpV V KV KV V KV KV K al K al K al V e V K al T Klial V e V e V e al e lial S lial N ao a T e KSV e al K e al e tal e tal e tal al y u e u u Vy u y u V a y K Vy uK y u u y u hu uhu y K al V y V T cka b T ck a y e al 2 e K 1 u uV uV e al e 4 ee 2 e y V e al 2 e e 7 u al y 5 Kffl 3 e Kffl e V al e al T s T 6 eRu 6 eRV e e al 8 e al e u y u 7e 8 e 0 V al e 4 1 al 2 u Aeu 8 y eal 3 JDBC ma a u u or or or or or or or or or or T y eu 9 y eal 4 B R R R R R R R R R R s e e e e e e e d e d e d e d d d d d d d al 5 c c c c c c c c c c T k T Cu y e D u u e a a 1 e y e y u 2 e Te sr s k T 3Ae 5d0e 4de yAu 6 3 u e e e Na s a s 7 A nd 1 e 6 8 e nd 7 8 e 4 k HDFS T a a k a uB2 uC9 s k s kr MapReduce S S5D To T c(Figure A-2) s sr c c (Figure A-1) Tk a k T ort e ort e d kr a kr T r c Tr T T Java programs execute Te a e a c T a a ar k ar MapReduce jobs to extract and c The ar r c entire history of ... logs is permanentlysstored kr s c e a c k k transform any subset of ... log data, k a k e a in Hadoop, making it possible to back-populate c k r/ c k and then write the aggregated newe metrics with old data, perform year-over- c kBI D e r/ c k results into the relational data marts e e r/trend reports, and manually mine data as yeark r/ e Dk a e via JDBC. r/ D needed. r/ D e r/ t r/ a e D a D a r/ D a Dt r/ at at D a D quite a few maturing open source tools that can provide analysts direct access to N There a at a are also at at N a ot Hadoop data. N a N a oFor instance, a desktop tool like HBase or Hive can be used as a SQL-like interface into at d at Hadoop, N permitting analysts to run queries in much the same way that they would access a traditional o N o a N d a e N o o data warehouse. These tools might be useful to ... personnel who want to perform analyses that are d d N o e N o d e d e not immediately available through the Reporting Portal. Such tools are best suited for more technically o d o d e e literatedanalysts who are comfortable writing their own queries and do not require fast query response e d e time. e e Cloudera (http://www.cloudera.com/) recently unveiled its browser-based Cloudera Desktop product. This tool simplifies some of the work required to set up, execute, and monitor MapReduce jobs. For the more technically inclined analysts in ... organization, Cloudera Desktop might be a good fit—even better than one of the SQL emulators like HBase. Cloudera Desktop’s main features include:  File Browser – Navigate the Hadoop file system  Job Browser – Examine MapReduce job states  Job Designer – Create MapReduce job designs  Cluster Health – At-a-glance state of the Hadoop cluster It is also possible to use Hadoop’s MapReduce to generate ―canned reports‖ in batch processing mode. That is, nightly batch jobs can be scheduled to produce static reports. These reports would consume data directly from Hadoop, and the resulting content could be pre-formatted for presentation via HTML. Such reports would effectively by-pass the relational data mart altogether. 8 ... Architecture Proposal Confidential
  • 9. b. Data Marts Stated simply, Hadoop can make an excellent contribution as a component of a business intelligence solution, but it cannot be the whole solution. A key limitation is that a data warehouse is indexed to provide fast query response time, while Hadoop data is not. A data warehouse (or data mart) typically contains pre-aggregated metrics in order to deliver selected results as fast as possible (i.e., without re-aggregating on the fly). Therefore, a gating factor in deciding whether to run analytic queries and reports against Hadoop is the end user’s expectation for response time. Since ... customers expect and deserve immediate to near- immediate query performance, directly querying Hadoop is not a viable design for the Reporting Portal. It’s also worth noting here that most of the mature, industry-standard OLAP tools like BusinessObjects and MicroStrategy cannot be coupled directly with Hadoop. Therefore, the ... reporting infrastructure will still require a traditional, relational, indexed data store containing pre- aggregated metrics. This data store is rightly called a data mart, because it is not the historical repository of detailed data, or system of record. All of its content can be regenerated at any time from the upstream data source. ... has two basic architectural decisions to make with regard to the data mart. First is whether to create one data mart or multiple data marts. The second decision is which brand of RDBMS to implement. i. One vs. Many There are a couple of compelling reasons to implement multiple, separate data marts. One reason is performance. The less data you cram into a relational database, the faster it generally performs. There can be exceptions to this rule (like ParAccel’s Analytic Database), but relational databases are usually more responsive with smaller data volumes. A second motivation for splitting ... data into multiple marts is security. It’s certainly quite possible to implement robust security within a single relational database instance, but physically separating each customer’s data definitely ensures that they cannot see one another’s content. However, it is strongly recommended that ... not rely solely on physical separation to enforce data security. There might be situations in which it is not economical store lots of small customers’ data separately. ... should retain the option to co-mingle multiple customers’ data in one database instance, while ensuring privacy to each of them. 9 ... Architecture Proposal Confidential
  • 10. Figure 4. Multiple Data Marts Relational Data Marts Customer A Customer B Customer C System of record contains all historical detail. … A third reason for implementing multiple data marts is customizability. It’s quite possible that Customer A might require different kinds of metrics from what Customer B needs. One data mart would have to be all things to all customers, making it horribly complex. The turnaround time required to add customer-specific metrics would be greatly improved by hosting them in a dedicated data mart. Having multiple data marts would be very similar to ... current reporting architecture, which uses dedicated MySQL schemas to partition customer data. ii. Brand of RDBMS There are several factors influencing ... choice of relational database management system. The primary factor will likely be data volume, which itself is influenced by many factors (e.g., data model, historical timeframe, individual customer’s ... log volume). Therefore, within the context of this proposal, it is not possible to accurately estimate data sizing. Instead, we can provide some basic guidance for future reference. From our experience, relatively small volumes (i.e., 10s of GB or less) can be comfortably accommodated by MySQL. Medium volumes (up to 100s of GB) are better served by Microsoft SQL Server or Oracle. Large volumes 100s of GB to TB-scale) require a columnar MPP database like ParAccel Analytical Database, Netezza, Teradata, Exadata, or Vertica. In addition to data volumes, ... will likely consider cost. MySQL is free, while other products can costs hundreds of thousands of dollars to purchase. The cost of a given RDBMS may also depend in part of the hardware needed to support it. Some RDBMS products only run on certain brands of hardware. Clearly, this can have far-reaching ramifications for ... costs of operations. We recommend that ... choose database software that can run on any Intel-powered, rackable server. Such hardware will provide the most economical scalability path. 10 ... Architecture Proposal Confidential
  • 11. Table 1. RDBMS Recommendations Data Volume Brand Notes Up to 10s of GB MySQL Free, but doesn’t scale well Up to 100s of Good value for money, easy to run on Microsoft SQL Server GB commodity hardware 100s of GB to ParAccel Analytic Powerful, hardware-flexible, negotiable TB Database pricing model c. Reporting Portal ... next generation Reporting Portal could provide its customers with a greatly expanded set of features if it is replaced with an industry-standard business intelligence tool like BusinessObjects or MicroStrategy. The choice of such tool will be essentially driven by how ... customers needs change and more importantly if ... start to have bigger corporations with existing IT architecture as client. On the short and middle term, an open source tools such as DataVision http://datavision.sourceforge.net will be a perfect solution allowing producing custom reports easily and generating the result using XML format. The XML format will allow to distribute the report almost Operating System agnostic. The only requirement will be to have XML file reading capabilities on the platform the reports needs to be visualized. These web-based tools leverage the power of metadata to enforce security and map business metrics to back-end data structures. A metadata-based tool flexibly supports business abstractions like categories and hierarchies that are not inherent to the physical data. Business intelligence tools offer a rich presentation layer capable of displaying the graphs, charts, and pivot tables that business users have come to expect from reporting interfaces. Figure 5. Browser-based Front-end Relational Data Marts BI Web Server Customer’s Browser Internet Vendor supported business intelligence ... Network application provides richly featured, web- based interface. Customers can run BI Metadata standard and customer reports, ad-hoc Repository queries, generate charts and graphs, save results to Excel, etc. 11 ... Architecture Proposal Confidential
  • 12. By leveraging a mature front-end technology, ... gains the advantage of reducing its internal Java development effort, while giving its customers a greatly expanded set of reporting and OLAP functionality. There a many products on the market, some cheaper and less mature than the long-standing industry leaders, Business Objects XI 3.1 and Micro Strategy 9. Our recommendation to ... is to be willing to invest in this customer-facing component so that it reinforces the most appealing impression in its end users. d. Hardware All of the technologies outlined thus far will run quite well on the type of hardware that ... currently uses to serve the Reporting Portal’s data warehouse. ... could purchase several more of the rackable Dell PowerEdge 2950 server trays running Windows Server 2003 and array them as a Hadoop cluster, data mart hosts, or web servers. Operational considerations like data center space and power notwithstanding, this hardware choice would preserve ... current SOE (standard operating environment), and minimize retraining of operations staff. e. Java Programming On reason that the Hadoop technology was selected is the high degree of skill and experience that ... personnel have with Java programming. As discussed earlier, interfaces into and out of Hadoop will most likely be coded in Java. These interfaces would likely be designed, developed, tested, and supported by ... personnel. At first blush, this statement might raise concerns about the cost of hand-coding data interfaces, versus buying a vendor-supported product. However, there are currently no data integration products available on the market to perform these tasks. Furthermore, if an off-the-shelf data integration (ETL) tool like Informatica PowerCenter could be purchased, it would still require expensive consulting services to implement and support. Net net, programming these interfaces in Java is actually a very logical choice for .... 5. Data Anomaly Detection In addition, thanks to its extensive analytics capabilities and performances, Hadoop allows doing different kind of deep analysis to define and then detect data anomaly patterns and report them in minutes. You’ll find attached several documents describing different anomaly approaches. In addition, there is a lot of information available on Hadoop Wiki such as http://wiki.apache.org/hadoop/Anomaly_Detection_Framework_with_Chukwa describing Chukwa framework to detect anomalies. 6. Data integration/importation and Data Quality Management As an alternative using Hadoop ETL features, Cloudera (open source editor of Hadoop) and Talend (open source ETL tool – Extract Transform and Load) recently announced a technology partnership http://www.cloudera.com/company/press- center/releases/talend_and_cloudera_announce_technology_partnership_to_simplify_processing_of_la rge_scale_data. Talend is the recognized market leader in open source data management. Talend’s solutions and services allow minimizing the costs and maximizing the value of data integration, ETL, data quality and master data management. We highly recommend using Talend as the dedicated tool for data integration, ETL and data quality. 12 ... Architecture Proposal Confidential
  • 13. 7. Summary Based on key factors like terabyte-scale data volumes, log files as data source, and customer-facing OLAP, the optimal architecture for ... Reporting Portal infrastructure comprises a cloud computing model with distributed file storage; distributed processing; optimized, relational data marts; and an industry-leading, web-based, metadata-driven business intelligence package. The cloud computing architecture affords ... virtually unlimited, linear scalability that can grow economically with demand. Relational data marts ensure excellent query performance and low-risk flexibility for adding metrics, changing reporting hierarchies, etc. 13 ... Architecture Proposal Confidential
  • 14. Appendix A. Hadoop Overview Due to their sheer size, large applications like ...s data warehouse require more resources than can typically be served by a single, cost-effective machine. Even if a large, expensive server could be configured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machine could provide the continuous, uninterrupted operation needed by today’s full-time applications. The Hadoop open-source framework—or Hadoop Common, as it is now officially known—is a Java cloud computing architecture designed as an economical, scalable solution that provides seamless fault tolerance for large data applications. Hadoop is a top-level Apache Software Foundation project, being built and used by a community of contributors from all over the world. As such, Hadoop is not a vendor-supported software package. It is a development framework that requires in-depth programming skills to implement and maintain. Therefore, an organization that chooses to deploy Hadoop will need to employ skilled personnel to maintain the cluster, program MapReduce jobs, and develop input/output interfaces. Hadoop Common runs applications on large, high-availability clusters of commodity hardware. It implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. In addition, Hadoop Common provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and HDFS are designed so that node failures are automatically handled by the framework. MapReduce Hadoop supports the MapReduce parallel processing model, which was introduced by Google as a method of solving a class of petabyte-scale problems with large clusters of inexpensive machines. MapReduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop MapReduce framework harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases, a map phase and a reduce phase (see Figure A-1 below). Map In the map phase, the framework splits the input data set into a large number of fragments and assigns each fragment to a map task. The framework also distributes the many map tasks across the cluster of nodes on which it operates. Each map task consumes key/value pairs from its assigned fragment and produces a set of intermediate key/value pairs. For each input key/value pair (K,V), the map task invokes a user defined map function that transmutes the input into a different key/value pair (K',V'). Following the map phase the framework sorts the intermediate data set by key and produces a set of (K',V'*) tuples so that all the values associated with a particular key appear together. It also partitions the set of tuples into a number of fragments equal to the number of reduce tasks. Reduce In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output key/value pair (K,V). Once again, the framework distributes the many reduce tasks across the cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each reduce task. 14 ... Architecture Proposal Confidential
  • 15. Tasks in each phase are executed in a fault-tolerant manner. If node(s) fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes. Having many map and reduce tasks enables efficient load balancing and allows failed tasks to be re-run with small runtime overhead. The Hadoop MapReduce framework has a master/slave architecture comprising a single master server or JobTracker and several slave servers or TaskTrackers, one per node in the cluster. The master node manages the execution of jobs, which involves assigning small chunks of a large problem to many nodes. The master also monitors node failures and substitutes other nodes as needed to pick up dropped tasks. The JobTracker is the point of interaction between users and the framework. Users submit MapReduce jobs to the JobTracker, which puts them in a queue of pending jobs and executes them on a first-come, first-served basis. The JobTracker manages the assignment of map and reduce tasks to the TaskTrackers. The TaskTrackers execute tasks upon instruction from the JobTracker and also handle data motion between the Map and Reduce phases. 15 ... Architecture Proposal Confidential
  • 16. Figure A-1. MapReduce Model Input Data Set Record Record Record Record Record Record Record Record Record Record Record Record Record Record Record Split Split Split Phase Map Map Task Map Task Map Task Value0 Key3 Value6 Key1 Value1 ValueA Key7 Value2 Key5 ValueB Key2 Value7 Key2 Value8 Key2 Value3 Key6 ValueC Key4 Value9 Key4 Value4 Key8 ValueD Key8 Value5 Shuffle Shuffle And And Intermediate Sort Sort Phase Value0 Value3 Key1 Value1 Key2 Value7 Value8 ValueA Key3 Value6 Value4 Key4 Value9 Key5 ValueB Key6 ValueC Key7 Value2 Value5 Key8 ValueD Reduce Reduce Reduce Phase Task Task Record Record Record Record Record Record Record Record Record Record Output Data Set Hadoop Distributed File System (HDFS) Hadoop's Distributed File System (HDFS) is designed to reliably store very large files across clustered machines. It is inspired by the Google File System (GFS). HDFS sits on top of the native operating system’s file system and stores each file as a sequence of blocks. All blocks in a file except the last block are the same size. Blocks belonging to a file are replicated across machines for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once, read many" and have strictly one writer at any time. 16 ... Architecture Proposal Confidential
  • 17. Like Hadoop MapReduce, HDFS follows a master/slave architecture, made up of a robust master node and multiple data nodes (see Figure A-2 below). An HDFS installation consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, one per node in the cluster, which manage storage attached to the nodes that they run on. The NameNode makes file system namespace operations like opening, closing, and renaming of files and directories available via an RPC interface. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from file system clients. They also perform block creation, deletion, and replication upon instruction from the NameNode. Figure A-2. HDFS Model Client Switch 1 Gbit Switch Switch 100 Mbit 100 Mbit Rack Rack JobTracker TaskTracker/ DataNode NameNode TaskTracker/ DataNode TaskTracker/ TaskTracker/ DataNode DataNode TaskTracker/ TaskTracker/ DataNode DataNode TaskTracker/ TaskTracker/ DataNode DataNode TaskTracker/ TaskTracker/ DataNode DataNode 17 ... Architecture Proposal Confidential
  • 18. 8. Query Optimization Our recommendation is have a deep dive on the worst performing queries focusing on the ones running frequently. On the other hand moving most of the analytics from the MySQL production database to Hadoop will reduce the data volume and the load of the MySQL database. This will necessarily imply a performance improvement. 9. Access and Data Security During our discussions it was mentioned some efforts would be needed to better protect and encrypt the URL used to access the different website pages. In addition, we’ve suggested for future use to secure the data themselves doing some encryption. 10. Internal Management and Collaboration tools Sales Force appears to be the recommend choice in regards of its numerous management and collaboration features. It includes all the capabilities required: Contact management; Project management and time tracking; Technical Support Management … : Sales Force Professional is $65 /user/month = $3,900 (2 846€) per year for 5 users 18 ... Architecture Proposal Confidential
  • 19. 11. Sales Force and Force.com integration In addition, Sales Force offers a complete API named Force.com allowing integrating features on your existing platform. This API will allow for future use an easy way to integrate new features to ... application, such as mobile device support; interface with existing application using AppsExchange; Real-Time Analytics … 19 ... Architecture Proposal Confidential
  • 20. 12. Roadmap Hadoop Installation and configuration takes no more than 2 days for one person (see ―Building and Installing Hadoop-MapReduce‖ PDF file). We recommend taking seriously the design phase to build strong foundations of your future architecture. Your customers Datamart should take no more than a month for a full implementation. Regarding your internal Datamart the implantation time will depend on how deep you want to go in analytics, however gaining experience by implementing the customer Datamart this shouldn’t be longer than a month. Of course, we’ll be able to assist you as needed to follow up on your future architecture implementation. Cloudera is also providing different services on Hadoop: Professional Services (http://www.cloudera.com/hadoop-services) Best practices for setting up and configuring a cluster suitable to run Cloudera’s Distribution for Hadoop:  Choice of hardware, operating system, and related systems software  Configuration of storage in the cluster, including ways to integrate with existing storage repositories  Balancing compute power with storage capacity on nodes in the cluster A comprehensive design review of your current system and your plans for Hadoop:  Discovery and analysis sessions aimed at identifying the various data types and sources streaming into your cluster  Design recommendations for a data-processing pipeline that addresses your business needs Operational guidance for a cluster running Hadoop, including:  Best practices for loading data into the cluster and for ensuring locality of data to compute nodes  Identifying, diagnosing, and fixing errors in Hadoop and the site-specific analyses our customers run  Tools and techniques for monitoring an active Hadoop cluster  Advice on the integration of MapReduce job submission into an existing data-processing pipeline, so Hadoop can read data from, and write data to, the analytic tools and databases our customers already use  Guidance on the use of additional analytic or developmental tools, such as Hive and Pig, that offer high-level interfaces for data evaluation and visualization Hands-on help in developing Hadoop applications that deliver the data-processing and analysis you need. How to connect Hadoop to your existing IT infrastructure. We can help with moving data between Hadoop and data warehouses, collecting data from file systems, creating document repositories, logging infrastructure and other sources, and setting up existing visualization and analytic tools to work with Hadoop. Performance audits of your Hadoop cluster, with tuning recommendations for speed, throughput, and response times 20 ... Architecture Proposal Confidential
  • 21. Training (http://www.cloudera.com/hadoop-training) Cloudera offers numerous on-line training resources and live public sessions: Developer Training and Certification Cloudera offers a three-day training program targeted toward developers who want to learn how to use Hadoop to build powerful data processing applications. Over three days, this course will assume only a casual understanding of Hadoop and teach you everything you need to know to take advantage of some of the most powerful features. We’ll get into deep details about Hadoop itself, but also devote ample time for hands-on exercises, importing data from existing sources, working with Hive and Pig, debugging MapReduce and much more. A full agenda is on the registration page. This course includes the certification exam to become Cloudera Certified Hadoop Developer. Sysadmin Training and Certification Systems administrators need to know how Hadoop operates in order to deploy and manage clusters for their organizations. Cloudera offers a two-day intensive course on Hadoop for operations staff. The course describes Hadoop’s architecture, covers the management and monitoring tools most commonly used to oversee it, and provides valuable advice on setting up, maintaining and troubleshooting Hadoop for development and production systems. This course includes the certification exam to become Cloudera Certified Hadoop Administrator. HBase Training Use HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. HBase training covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices. This training is for developers (Java experience is recommended) who already have a basic understanding of Hadoop 21 ... Architecture Proposal Confidential