SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
2nd IEEE International Conference on Cloud Computing Technology and Science




                                 Attaching Cloud Storage to a Campus Grid
                                      Using Parrot, Chirp, and Hadoop


                                          Patrick Donnelly, Peter Bui, Douglas Thain
                                              Computer Science and Engineering
                                                   University of Notre Dame
                                                pdonnel3, pbui, dthain@nd.edu


                            Abstract                                     the needs of web search engines and other scalable applica-
                                                                         tions that require highly scalable streaming access to large
    The Hadoop filesystem is a large scale distributed filesys-            datasets. Hadoop excels at dealing with large files on the
 tem used to manage and quickly process extremely large                  order of terabytes and petabytes while ensuring data in-
 data sets. We want to utilize Hadoop to assist with data-               tegrity and resiliency through checksums and data replica-
 intensive workloads in a distributed campus grid environ-               tion. Hadoop is typically used to enable the massive pro-
 ment. Unfortunately, the Hadoop filesystem is not designed               cessing of large data files through the use of distributed
 to work in such an environment easily or securely. We                   computing abstractions such as Map Reduce [5].
 present a solution that bridges the Chirp distributed filesys-               In principle, we would like to deploy Hadoop as a scal-
 tem to Hadoop for simple access to large data sets. Chirp               able storage service in a central location, and allow jobs
 layers on top of Hadoop many grid computing desirables in-              running on the campus grid to access the filesystem. Unfor-
 cluding simple deployment without special privileges, easy              tunately, there are a number of practical barriers to accom-
 access via Parrot, and strong and flexible security Access               plishing this. Most applications are not designed to operate
 Control Lists (ACL). We discuss the challenges involved in              with the native Hadoop Java interface. When executing on a
 using Hadoop on a campus grid and evaluate the perfor-                  campus grid, it is not possible to install any dependent soft-
 mance of the combined systems.                                          ware on the local machine. The Hadoop protocol was not
                                                                         designed to be secure over wide area networks, and varies
                                                                         in significant ways across versions of the software, making
                                                                         if difficult to upgrade the system incrementally.
 1 Introduction                                                              To address each of these problems, we have coupled
                                                                         Hadoop to the Chirp filesystem, and employed the latter to
    A campus grid is a large distributed system encompass-               make the storage interface accessible, usable, secure, and
 ing all of the computing power found in a large institution,            portable. Chirp [15] is a distributed filesystem for grid com-
 from low-end desktop computers to high end server clus-                 puting created at Notre Dame and used around the world. It
 ters. In the last decade, a wide variety of institutions around         has been shown to give favorable performance [15] [14] for
 the world have constructed campus grids by deploying soft-              distributed workflows. It is primarily used to easily export
 ware such as the Condor [16] distributed computing system.              a portion of the local filesystem for use by remote work-
 One of the largest known public campus grids is BoilerGrid,             ers. Chirp is built to be easily deployed temporarily for
 consisting of over 20,000 cores harnessed at Purdue Uni-                a job and then torn down later. It supports unprivileged
 versity [2]; and systems consisting of merely a few thou-               deployment, strong authentication options, sensible secu-
 sand cores are found all around the country [13]. Campus                rity of data, and grid-friendly consistency semantics. Using
 grids have traditionally been most effective at running high-           Parrot [12], we can transparently connect the application to
 throughput computationally-intensive applications. How-                 Chirp and thereby to Hadoop.
 ever, as the data needs of applications in the campus grid                  The remainder of this paper presents our solution for at-
 increases, it becomes necessary to provide a system that can            taching Hadoop to a campus grid via the Chirp filesystem.
 provide high capacity and scalable I/O performance.                     We begin with a detailed overview of Hadoop, Parrot, and
    To address this, we consider the possibility of using                Chirp, explaining the limitations of each. We explored two
 Hadoop [8] as a scalable filesystem to serve the I/O needs               distinct solutions to this problem. By connecting Parrot
 of a campus grid. Hadoop was originally designed to serve               directly to Hadoop, we achieve the most scalable perfor-

978-0-7695-4302-4/10 $26.00 © 2010 IEEE                            488
DOI 10.1109/CloudCom.2010.74
mance, but a less robust system. By connecting Parrot to                     opers. A FUSE [1] module allows unmodified applica-
Chirp and then to Hadoop, we achieve a much more stable                      tions to access Hadoop through a user-level filesystem
system, but with a modest performance penalty. The final                      driver, but this method still suffers from the remaining
section of the paper evaluates the scalability of our solution               three problems:
on both local area and campus area networks.
                                                                         • Deployment. To access Hadoop by any method, it
                                                                           is necessary to install a large amount of software, in-
2 Background                                                               cluding the Java virtual machine and resolve a number
                                                                           of dependencies between components. Using FUSE
   The following is an overview of the systems we are try-                 module requires administrator access to install the
ing to connect to allow campus grid users access to scalable               FUSE kernel modules, install a privileged executable,
distributed storage. First, we examine the Hadoop filesys-                  and update the local administrative group definitions
tem and provide a summary of its characteristics and limita-               to permit access. On a campus grid consisting of thou-
tions. Throughout the paper we refer to the Hadoop filesys-                 sands of borrowed machines, this level of access is
tem as HDFS or simply Hadoop. Next, we introduce the                       simply not practical.
Chirp filesystem, which is used at the University of Notre
                                                                         • Authentication. Hadoop was originally designed with
Dame and other research sites for grid storage. Finally, we
                                                                           the assumption that all of its components were oper-
describe Parrot, a virtual filesystem adapter that provides
                                                                           ating on the same local area, trusted network. As a
applications transparent access to remote storage systems.
                                                                           result, there is no true authentication between compo-
                                                                           nents – when a client connects to Hadoop, it simply
2.1    Hadoop                                                              claims to represent principal X, and the server compo-
                                                                           nent accepts that assertion at face value. On a campus-
    Hadoop [8] is an open source large scale distributed                   area network, this level of trust is inappropriate, and a
filesystem modeled after the Google File System [7]. It pro-                stronger authentication method is required.
vides extreme scalability on commodity hardware, stream-
ing data access, built-in replication, and active storage.               • Interoperability. The communication protocols used
Hadoop is well known for its extensive use in web search                   between the various components of Hadoop are also
by companies such as Yahoo! as well as by researchers for                  tightly coupled, and change with each release of the
data intensive tasks like genome assembly [10].                            software as additional information is added to each in-
    A key feature of Hadoop is its resiliency to hardware                  teraction. This presents a problem on a campus grid
failure. The designers of Hadoop expect hardware failure                   with many simultaneous users, because applications
to be normal during operation rather than exceptional [3].                 may run for weeks or months at a time, requiring up-
Hadoop is a ”rack-aware” filesystem that by default repli-                  grades of both clients and servers to be done indepen-
cates all data blocks three times across storage components                dently and incrementally. A good solution requires
to minimize the possiblty of data loss.                                    both forwards and backwards compatibility.
    Large data sets are a hallmark of the Hadoop filesystem.
File sizes are anticipated to exceed hundreds of gigabytes             2.2     Chirp
to terabytes in size. Users may create millions of files in a
single data set and still expect excellent performance. The                Chirp [15] is a distributed filesystem used in grid com-
file coherency model is ”write once, read many”. A file that             puting for easy userlevel export and access to file data. It is a
is written and then closed cannot be written to again. In              regular user level process that requires no privileged access
the future, Hadoop is expected to provide support for file              during deployment. The user simply specifies a directory on
appends. Generally, users do not ever need support for ran-            the local filesystem to export for outside access. Chirp of-
dom writes to a file in everyday application.                           fers flexible and robust security through per-directory ACLs
    Despite many attractive properties, Hadoop has several             and various strong authentication modes. Figure 1a pro-
essential problems that make it difficult to use from a cam-            vides a general overview of the Chirp distributed filesystem
pus grid environment:                                                  and how it is used.
                                                                           A Chirp server is set up using a simple command as a
  • Interface. Hadoop provides a native Java API for ac-               regular user. The server is given the root directory as an op-
    cessing the filesystem, as well as a POSIX-like C API,              tion to export to outside access. Users can authenticate to
    which is a thin wrapper around the Java API. In princi-            the server using various schemes including the Grid Secu-
    ple, applications could be re-written to access these in-          rity Infrastructure (GSI) [6], Kerberos [11], or hostnames.
    terfaces directly, but this is a time consuming and intru-         Access to data is controlled by a per-directory ACL system
    sive change that is unattractive to most users and devel-          using these authentication credentials.


                                                                 489
tcsh      emacs        perl          appl                     tcsh         emacs        perl           appl

                       ordinary unix system calls                                      ordinary unix system calls
                                                            custom                                                            custom
                           parrot             fuse           tools                         parrot              fuse            tools
               file                                                           file
              cache       libchirp        libchirp          libchirp         cache        libchirp         libchirp           libchirp

                                  chirp network protocol                                           chirp network protocol


                                           chirp                                    Hadoop                  chirp
                                          server                                   Headnode                server




                                          unix
                                         filesys
                                                                                        Hadoop
                                                                                        datanode
                                                                                                      Hadoop
                                                                                                      datanode
                                                                                                                      ...     Hadoop
                                                                                                                              datanode

                            (a) The Chirp Filesystem                                 (b) The Chirp Filesystem Bridging Hadoop

Figure 1: Applications can use Parrot, FUSE, or libchirp directly to communicate with a Chirp server on the client side. The
modifications presented in this paper, Chirp can now multiplex between a local unix filesystem and a remote storage service
such as Hadoop.


   A Chirp server requires neither special privileges nor                    code to a Chirp filesystem without having to interface di-
kernel modification to operate. Chirp can both be deployed                    rectly with libchirp. While applications may also use FUSE
to a grid by a regular user to operate on personal data se-                  to remotely mount a Chirp fileserver, Parrot is more practi-
curely. We use Chirp at the University of Notre Dame to                      cal in a campus grid system where users may not have the
support workflows that require a personal file bridge to ex-                   necessary administrative access to install and run the FUSE
pose filesystems such as AFS [9] to grid jobs. Chirp is also                  kernel module. Because Parrot runs in userland and has
used for scratch space that harnesses the extra disk space of                minimal dependencies, it is easy to distribute and deploy in
various clusters of machines. One can also use Chirp as a                    restricted campus grid environments.
cluster filesystem for greater I/O bandwidth and capacity.
                                                                             3 Design and Implementation
2.3    Parrot

   To enable existing applications to access distributed                        This section describes the changes we have made to im-
filesystems such as Chirp, the Parrot [12] virtual filesys-                    plement a usable and robust system that allows for appli-
tem adapter is used to redirect I/O operations for unmod-                    cations to transparently access the Hadoop filesystem. The
ified applications. Normally, applications would need to be                   first approach is to connect Parrot directly to HDFS, which
changed to allow them to access remote storage services                      provides for scalable performance, but has some limita-
such Chirp or Hadoop. With Parrot attached, unmodified                        tions. The second approach is to connect Parrot to Chirp
applications can access remote distributed storage systems                   and thereby Hadoop, which allows us to overcome these
transparently. Because Parrot works by trapping program                      limitations with a modest performance penalty.
system calls through the ptrace debugging interface, it can
be deployed and installed without any special system privi-                  3.1      Parrot + HDFS
leges or kernel changes.
   One service Parrot supports is the Chirp filesystem pre-                      Our initial attempt at integrating Hadoop into our cam-
viously described. Users may connect to Chirp using a va-                    pus grid environment was to utilize the Parrot middleware
riety of methods. For instance, the libchirp library provides                filesystem adapter to communicate with our test Hadoop
an API that manages connections to a Chirp server and al-                    cluster. To do this we used libhdfs, a C/C++ binding for
lows user to construct custom applications. However, a user                  HDFS provided by the Hadoop project, to implement a
will generally use Parrot to transparently connect existing                  HDFS service for Parrot. As with other Parrot services, the

                                                                       490
ever, were fully supported. The parrot hdfs module did not
       tcsh      emacs       perl      appl                            provide a means of overcoming this limitation.
                                                                          Finally, the biggest impediment to using both Hadoop
                                                                       and the parrot hdfs module was the lack of proper access
                                    ordinary unix system calls
                                                                       controls. Hadoop’s permissions can be entirely ignored as
                 parrot                                                there is no authentication mechanism. The filesystem ac-
                 libhdfs                                               cepts whatever identity the user presents to Hadoop. So
                                                                       while a user may partition the Hadoop filesystem for dif-
                                    Hadoop RPC                         ferent groups and users, any user may report themselves as
                                                                       any identity thus nullifying the permission model. Since
                                                                       parrot hdfs simply provides the underlying access control
                                                                       system, this problem is not mitigated in anyway by the ser-
    Hadoop
   Headnode
                  Hadoop
                  Datanode
                               Hadoop
                               Datanode
                                          ...   Hadoop
                                                Datanode
                                                                       vice. For a small isolated cluster filesystem this is not a
                                                                       problem, but for a large campus grid, such a lack of permis-
                                                                       sions is worrisome and a deterent for wide adoption, espe-
                                                                       cially if sensitive research data is being stored.
Figure 2: In the parrot hdfs tool, we connect Parrot directly             So although parrot hdfs is a useful tool for integrating
to Hadoop using libhdfs, which allows client applications to           Hadoop into a campus grid environment, it has some limita-
communicate with a HDFS service.                                       tions that prevent it from being a complete solution. More-
                                                                       over, Hadoop itself has limitations that need to be addressed
                                                                       before it can be widely deployed and used in a campus grid.
HDFS Parrot module (parrot hdfs) allows unmodified end
user applications to access remote filesystems, in this case            3.2    Parrot + Chirp + HDFS
HDFS, transparently and without any special priviledges.
This is different from tools such as FUSE which require                   To overcome these limitations in both parrot hdfs and
kernel modules and thus adminstrative access. Since Par-               Hadoop, we decided to connect Parrot to Chirp and then
rot is a static binary and requires no special priviledges, it         Hadoop. In order to accomplish this, we had to implement
can easily be deployed around our campus grid and used to              a new virtual multiplexing interface in Chirp to allow it to
access remote filesystems.                                              to use any backend filesystem. With this new capability, a
   Figure 2 shows the architecture of the parrot hdfs mod-             Chirp server could act as a frontend to potentially any other
ule. Applications run normally until an I/O operation is               filesystem, such as Hadoop, which is not usually accessible
requested. When this occurs, Parrot traps the system call              with normal Unix I/O system calls.
using the ptrace system interface and reroutes to the appro-              In our modified version of Chirp, the new virtual mul-
priate filesystem handler. In the case of Hadoop, any I/O               tiplexing interface routes each I/O Remote Procedure Call
operation to the Hadoop filesytem is implemented using the              (RPC) to the appropriate backend filesystem. For Hadoop,
libhdfs library. Of course, the trapping and redirection of            all data access and storage on the Chirp server is done
system calls incurs some overhead, but as shown in a pre-              through Hadoop’s libhdfs C library. Figure 1b illustrates
vious paper [4], the parrot hdfs module is quite serviceable           how Chirp operates with Hadoop as a backend filesystem.
and provides adequate performance.                                        With our new system, a simple command line switch
   Unfortunately, the parrot hdfs module was not quite a               changes the backend filesystem during initialization of the
perfect solution for integrating Hadoop into our campus in-            server. The command to start the Chirp server remains sim-
frastructure. First, while Parrot is lightweight and has mini-         ple to start and deploy:
mal dependencies, libhdfs, the library required for commu-
nicating to Hadoop, does not. In fact, to use libhdfs we               $ chirp_server -f hdfs
must have access to a whole Java virtual machine installa-                            -x headnode.hadoop:9100
tion, which is not always possible in a heterogeneous cam-
pus grid with multiple administrative domains.                             Clients that connect to the Chirp server such as the one
   Second, only sequential writing was supported by the                above will work over HDFS instead of on the the server’s
module. This is because Hadoop, by design, is a write-once-            local filesystem. The head node of Hadoop and port it
read-many filesystem and thus is optimized for streaming                listens on are specified using the -x <host>:<port>
reads rather than frequent writes. Moreover, due to the in-            switch. The root of the Hadoop filesystem is the exported
stability of the append feature in the version of Hadoop de-           directory by Chirp changed via the -r switch. Because
ployed in our cluster, it was also not possible to support ap-         Hadoop and Java require multiple inconvenient environ-
pends to existing files. Random and sequential reads, how-              ment variables to be setup prior to execution, we have also


                                                                 491
provided the chirp_server_hdfs wrapper which sets                           replication and correctly dealing with the (now) muta-
these variables to configuration defined values before run-                   ble last block of a file. As of now, turning appends on
ning chirp_server.                                                          requires setting experimental parameters in Hadoop.
    Recall that Hadoop has no internal authentication sys-                  Chirp provides some support for appends by imple-
tem; it relies on external mechanisms. A user may de-                       menting the operation through other primitives within
clare any identity and group membership when establishing                   Hadoop. That is, Chirp copies the file into memory,
a connection with the Hadoop cluster. The model naturally                   deletes the file, writes the contents of memory back to
relies on a trustworthy environment; it is not expected to be               the file, and returns the file for further writes – effec-
resilient to malicious use. In a campus environment, this                   tively allowing appends. Support for appends to small
may not be the case. So, besides being an excellent tool for                files fit most use cases, logs in particular. Appending
grid storage in its own right, Chirp brings the ability to wall             to a large file is not advised. The solution is naturally
off Hadoop from the campus grid while still exposing it in                  less than optimal but is necessary until Hadoop pro-
a secure way. To achieve this, we would expect the admin-                   vides non-experimental support.
istrator to firewall off regular Hadoop access while still ex-
posing Chirp servers running on top of Hadoop. This setup
is a key difference to parrot hdfs, discussed in the previous             • Random Writes Random writes to a file are not al-
section. Chirp brings its strong authentication mechanisms                  lowed by Hadoop. This type of restriction generally
and flexible access control lists to protect Hadoop from un-                 affects programs, like linkers, that do not write to a file
intended or accidental access whereas parrot hdfs connects                  sequentially. In Chirp, writes to areas other than the
to an exposed, unprotected Hadoop cluster.                                  end of the file are silently changed to appends instead.

3.3    Limitations when Bridging Hadoop                                   • Other POSIX Incompatibilities Hadoop has other
       with Chirp                                                           peculiar incompatibilities with POSIX. The execute bit
                                                                            for regular files is nonexistent in Hadoop. Chirp gets
    The new modified Chirp provides users with an approxi-                   around this by reporting the execute bit set for all users.
mate POSIX interface to the Hadoop filesystem. Certain re-                   Additionally, it does not allow renaming a file to one
strictions are unavoidable due to limitations stemming from                 that already exists. This would be equivalent to the
design decisions in HDFS. The API provides most POSIX                       Unix mv operation from one existing file to another ex-
operations and semantics that are needed for our bridge with                isting file. Further, opening, unlinking, and closing a
Chirp, but not all. We list certain notable problems we have                file will result in Hadoop failing with an EINTERNAL
encountered here.                                                           error. For these types of semantic errors, the Chirp
                                                                            server and client have no way of reasonably respond-
  • Errno Translation The Hadoop libhdfs C API binding
                                                                            ing or anticipating such an error. We expect that as the
    interprets exceptions or errors within the Java runtime
                                                                            HDFS binding matures many of these errors will be
    and sets errno to an appropriate value. These trans-
                                                                            responded to in a more reasonable way.
    lations are not always accurate and some are missing.
    A defined errno value within libhdfs is EINTERNAL
    which is the generic error used when another is not
    found to be more appropriate. Chirp tries to sensi-                 4 Evaluation
    bly handle these errors by returning a useful status to
    the user. Some errors must be caught early because
    Hadoop does not properly handle all error conditions.                   To evaluate the performance characteristics of our Par-
    For example, attempting to open a file that is a direc-              rot + Chirp + HDFS system, we ran a series of benchmark
    tory results errno being set to EINTERNAL (an HDFS                  tests. The first is a set of micro-benchmarks designed to
    generic EIO error) rather than the correct EISDIR.                  test the overhead for individual I/O system calls when us-
                                                                        ing Parrot with various filesystems. The second is a data
  • Appends to a File Appends are largely unsupported                   transfer performance test of our new system versus the na-
    in Hadoop within production environments. Hadoop                    tive Hadoop tools. Our tests ran on a 32 node cluster of
    went without appends for a long time in its history be-             machines (located off Notre Dame’s main campus) with the
    cause the Map Reduce abstraction did not benefit from                Hadoop setup using 15 TB of aggregate capacity. The clus-
    it. Still, use cases for appends surfaced and work to               ter is using approximately 50% of available space. The
    add support – and determine the appropriate semantics               Hadoop cluster shares a 1 gigabit link to communicate with
    – has been in progress since 2007. The main barri-                  the main campus network. We have setup a Chirp server as
    ers to its developement are coherency semantics with                a frontend to the Hadoop cluster on one of the data nodes.


                                                                  492
50                                                                                50
                                                                                               2 Clients                                                                       2 Clients
                                                                                               4 Clients                                                                       4 Clients
                                                                                               8 Clients                                                                       8 Clients
                                                    40                                        16 Clients                              40                                      16 Clients
                                 Bandwidth (MB/s)




                                                                                                                   Bandwidth (MB/s)
                                                    30                                                                                30


                                                    20                                                                                20


                                                    10                                                                                10


                                                     0                                                                                 0
                                                         0     1000 2000 3000 4000 5000 6000 7000 8000                                     0    1000 2000 3000 4000 5000 6000 7000 8000
                                                                             File Size (MB)                                                                  File Size (MB)

                                                         (a) Campus Network Reads to Chirp+HDFS.                                               (b) Campus Network Reads to HDFS.


                                                    50                                                                                50
                                                                                               2 Clients                                                                       2 Clients
                                                                                               4 Clients                                                                       4 Clients
                                                                                               8 Clients                                                                       8 Clients
                                                    40                                        16 Clients                              40                                      16 Clients
                                 Bandwidth (MB/s)




                                                                                                                   Bandwidth (MB/s)
                                                    30                                                                                30


                                                    20                                                                                20


                                                    10                                                                                10


                                                     0                                                                                 0
                                                         0     1000 2000 3000 4000 5000 6000 7000 8000                                     0    1000 2000 3000 4000 5000 6000 7000 8000
                                                                             File Size (MB)                                                                  File Size (MB)

                                                         (c) Campus Network Writes to Chirp+HDFS.                                              (d) Campus Network Writes to HDFS.

                                                                                                           Figure 3


                                35
                                                                        Parrot+Chirp+Unix
                                                                                                                        The results of these micro-benchmarks are shown in Fig-
                                                                                                                     ure 5. As can be seen, our new Parrot + Chirp + HDFS
       Latency (Milliseconds)




                                30                                           Parrot+HDFS
                                                                       Parrot+Chirp+HDFS
                                25                                                                                   systems has some overhead when compared to accessing a
                                20                                                                                   Chirp server or HDFS service using Parrot. In particular, the
                                15                                                                                   metadata operations of stat and open are particularly heavy
                                10                                                                                   in our new system while read and close are relatively close.
                                 5                                                                                   The reason for this high overhead on metadata-intensive op-
                                 0                                                                                   erations is because of the access control lists Chirp uses to
                                                             stat    open     read        close
                                                                    Micro-Operations                                 implement security and permissions. For each stat or open,
                                                                                                                     multiple files must be opened and checked in order to verify
 Figure 5: Latency of I/O Operations using Microbench                                                                valid access, which increases the latencies for these system
                                                                                                                     calls.

4.1    Micro-benchmarks                                                                                              4.2                       Performance benchmarks

   The first set of tests we ran were a set of micro-                                                                    For our second set of benchmarks, we stress test the
benchmarks that measure the average latency of various I/O                                                           Chirp server by determining the throughput achieved when
system calls such as stat, open, read, close. For this test we                                                       putting and getting very large files. These types of com-
wanted to know the cost of putting Chirp in front of HDFS                                                            mands are built into Chirp as the RPC’s getfile and
and using Parrot to access HDFS through our Chirp server.                                                            putfile. Our tests compared the equivalent Hadoop
We compared these latencies to those for accessing a Chirp                                                           native client hadoop fs get and hadoop fs put
server hosting a normal Unix filesystem through Parrot, and                                                           commands to the Chirp ones. The computer on the cam-
accessing HDFS directly using Parrot.                                                                                pus network initiating the tests used the Chirp command


                                                                                                             493
100                                                                                100
                                                                            2 Clients                                                                          2 Clients
                                                                            4 Clients                                                                          4 Clients
                                                                            8 Clients                                                                          8 Clients
                                 80                                        16 Clients                               80                                        16 Clients
             Bandwidth (MB/s)




                                                                                                Bandwidth (MB/s)
                                 60                                                                                 60


                                 40                                                                                 40


                                 20                                                                                 20


                                     0                                                                                  0
                                         0   1000 2000 3000 4000 5000 6000 7000 8000                                        0   1000 2000 3000 4000 5000 6000 7000 8000
                                                          File Size (MB)                                                                     File Size (MB)

                                         (a) LAN Network Reads to Chirp+HDFS.                                                   (b) LAN Network Reads to HDFS.


                                50                                                                                 50
                                                                            2 Clients                                                                          2 Clients
                                                                            4 Clients                                                                          4 Clients
                                                                            8 Clients                                                                          8 Clients
                                40                                         16 Clients                              40                                         16 Clients
             Bandwidth (MB/s)




                                                                                                Bandwidth (MB/s)
                                30                                                                                 30


                                20                                                                                 20


                                10                                                                                 10


                                 0                                                                                  0
                                     0       1000 2000 3000 4000 5000 6000 7000 8000                                    0       1000 2000 3000 4000 5000 6000 7000 8000
                                                         File Size (MB)                                                                      File Size (MB)

                                         (c) LAN Network Writes to Chirp+HDFS.                                                  (d) LAN Network Writes to HDFS.

                                                                                        Figure 4


line tool, which uses the libchirp library without any ex-                                        chines will follow access the Hadoop storage in this way.
ternal dependencies, and the Hadoop native client, which                                          For contrast, in the next tests, we remove this barrier.
requires all the Hadoop and Java dependencies. We want to
                                                                                                     Next, we place the clients on the same LAN as the
test how the Chirp server performs under load compared to
                                                                                                  Hadoop cluster and the Chirp server frontend. We expect
the Hadoop cluster by itself. For these tests, we communi-
                                                                                                  that throughput will not be hindered by the link rate of the
cated with the Hadoop cluster and the Chirp server frontend
                                                                                                  network and parallelism will allow Hadoop to scale. Similar
from the main campus. All communication went through
                                                                                                  to the tests from the main campus grid, the clients will put
the 1 gigabit link to the cluster.
                                                                                                  and get files to the Chirp server and via the Hadoop native
   The graphs in Figures 3a and 3b show the results of read-                                      client.
ing (getting) a file from the Chirp server frontend and via                                            The graphs in Figures 4a and 4b show the results of read-
the the native Hadoop client, respectively. We see that the                                       ing a file over the LAN setup. We see that we get simi-
Chirp server is able to maintain throughput comparable to                                         lar throughput over the Chirp server as in the main campus
the Hadoop native client; the Chirp server does not present                                       tests. While the link to the main campus grid is no longer
a real bottleneck to reading a file from Hadoop. In Fig-                                           a limiting factor, the Chirp server’s machine still only has
ures 3c and 3d, we have the results of writing (putting) a                                        a gigabit link to the LAN. Meanwhile, Hadoop achieves
file to the Chirp server frontend and native Hadoop client,                                        respectably good performance, as expected, when reading
again respectively. These graphs similarly show that the                                          files with a large number of clients. The parallelism gives
Chirp server obtains similar throughput to the Hadoop na-                                         Hadoop almost twice the throughput. For a larger number
tive client for writes. For these tests, the 1 gigabit link plays                                 of clients, Hadoop’s performance outstrips chirp by a large
a major role in limiting the exploitable parallelism obtain-                                      factor of 10. The write performance in Figures 4c and 4d,
able via writing to Hadoop over a faster (local) network.                                         show similar results. The writes over the LAN bottleneck
Despite hamstringing Hadoop’s potential parallelism, we                                           on the gigabit link to the Chirp server while Hadoop main-
find these tests valuable because the campus network ma-                                           tains minor performance hits as the number of clients in-


                                                                                          494
crease. In both cases, the size of the file did not have a                [5] J. Dean and S. Ghemawat. Mapreduce: Simplified
significant effect on the average throughput. Because one                     data processing on large cluster. In Operating Systems
(Chirp) server cannot handle the throughput due to net-                      Design and Implementation, 2004.
work link restrictions in a LAN environment, we recom-
mend adding more Chirp servers on other machines in the                  [6] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke.
LAN to help distribute the load.                                             A Security Architecture for Computational Grids. In
   We conclude from these benchmark evaluations that                         ACM Conference on Computer and Communications
Chirp achieves adequate performance compared to Hadoop                       Security, pages 83–92, San Francisco, CA, November
with some overhead. In a campus grid setting we see that                     1998.
Chirp gets approximately equivalent streaming throughput                 [7] S. Ghemawat, H. Gobioff, and S. Leung. The Google
compared to the native Hadoop client. We find this accept-                    filesystem. In ACM Symposium on Operating Systems
able for the type of environment we are targeting.                           Principles, 2003.
                                                                         [8] Hadoop. http://hadoop.apache.org/, 2007.
5 Conclusions
                                                                         [9] J. Howard, M. Kazar, S. Menees, D. Nichols,
   Hadoop is a highly scalable filesystem used in a wide ar-                  M. Satyanarayanan, R. Sidebotham, and M. West.
ray of disciplines but lacks support for campus or grid com-                 Scale and performance in a distributed file system.
puting needs. Hadoop lacks strong authentication mecha-                      ACM Trans. on Comp. Sys., 6(1):51–81, February
nisms and access control. Hadoop also lacks ease of ac-                      1988.
cess due to the number of dependencies and program re-                  [10] M. C. Schatz. CloudBurst: highly sensitive read map-
quirements. We have designed a solution that enchances the                   ping with MapReduce. Bioinformatics, 25(11):1363–
functionality of Chirp, coupled with Parrot or FUSE, to pro-                 1369, 2009.
vide a bridge between a user’s work and Hadoop. This new
system can be used to correct these problems in Hadoop so               [11] J. Steiner, C. Neuman, and J. I. Schiller. Kerberos: An
it may be used safely and securely on the grid without sig-                  authentication service for open network systems. In
nificant loss in performance.                                                 Proceedings of the USENIX Winter Technical Confer-
   The new version of the Chirp software is now capable                      ence, pages 191–200, 1988.
of multiplexing the I/O operations on the server so opera-
tions are handled by filesystems other than the local Unix               [12] D. Thain and M. Livny. Parrot: Transparent user-level
system. Chirp being capable of bridging Hadoop to the grid                   middleware for data-intensive computing. Technical
for safe and secure use is a particularly significant conse-                  Report 1493, University of Wisconsin, Computer Sci-
quence of this work. Chirp can be freely downloaded at:                      ences Department, December 2003.
http://www.cse.nd.edu/ ccl/software.                                    [13] D. Thain and M. Livny. How to Measure a Large Open
                                                                             Source Distributed System. Concurrency and Compu-
References                                                                   tation: Practice and Experience, 18(15):1989–2019,
                                                                             2006.
 [1] Filesystem           in           user             space.          [14] D. Thain and C. Moretti. Efficient access to many
     http://sourceforge.net/projects/fuse.                                   small files in a filesystem for grid computing. In IEEE
                                                                             Conference on Grid Computing, Austin, TX, Septem-
 [2] Boilergrid                    web                    site.              ber 2007.
      http://www.rcac.purdue.edu/userinfo/resources/boilergrid,
     2010.                                                              [15] D. Thain, C. Moretti, and J. Hemmes. Chirp: A prac-
                                                                             tical global file system for cluster and grid computing.
 [3] D. Borthakur. The hadoop distributed file system:                        Journal of Grid Computing, to appear in 2008.
     Architecture and design. http://hadoop.apache.org/,
     2007.                                                              [16] D. Thain, T. Tannenbaum, and M. Livny. Condor and
                                                                             the grid. In F. Berman, G. Fox, and T. Hey, editors,
 [4] H. Bui, P. Bui, P. Flynn, and D. Thain. ROARS:                          Grid Computing: Making the Global Infrastructure a
     A Scalable Repository for Data Intensive Scientific                      Reality. John Wiley, 2003.
     Computing. In The Third International Workshop on
     Data Intensive Distributed Computing at ACM HPDC
     2010, 2010.


                                                                  495

Mais conteúdo relacionado

Mais procurados

App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for roboticsJoão Gabriel Lima
 
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...EMC
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projectsaf83
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011iammutex
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
 
20120524 cern data centre evolution v2
20120524 cern data centre evolution v220120524 cern data centre evolution v2
20120524 cern data centre evolution v2Tim Bell
 
Cache and consistency in nosql
Cache and consistency in nosqlCache and consistency in nosql
Cache and consistency in nosqlJoão Gabriel Lima
 
Ibm Watson Designed For Answers
Ibm Watson Designed For AnswersIbm Watson Designed For Answers
Ibm Watson Designed For AnswersBradley Gilmour
 
Watson A System Designed For Answers
Watson A System Designed For AnswersWatson A System Designed For Answers
Watson A System Designed For AnswersLee Wall
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesDataWorks Summit
 
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...Willy Marroquin (WillyDevNET)
 
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp
 

Mais procurados (18)

App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
Implementation of nosql for robotics
Implementation of nosql for roboticsImplementation of nosql for robotics
Implementation of nosql for robotics
 
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projects
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
20120524 cern data centre evolution v2
20120524 cern data centre evolution v220120524 cern data centre evolution v2
20120524 cern data centre evolution v2
 
Cache and consistency in nosql
Cache and consistency in nosqlCache and consistency in nosql
Cache and consistency in nosql
 
Ibm Watson Designed For Answers
Ibm Watson Designed For AnswersIbm Watson Designed For Answers
Ibm Watson Designed For Answers
 
Watson A System Designed For Answers
Watson A System Designed For AnswersWatson A System Designed For Answers
Watson A System Designed For Answers
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
Df35592595
Df35592595Df35592595
Df35592595
 
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...TECHNICAL OVERVIEW NVIDIA DEEP  LEARNING PLATFORM Giant Leaps in Performance ...
TECHNICAL OVERVIEW NVIDIA DEEP LEARNING PLATFORM Giant Leaps in Performance ...
 
NetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered StorageNetApp Unified Scale-Out/Clustered Storage
NetApp Unified Scale-Out/Clustered Storage
 

Semelhante a Attaching cloud storage to a campus grid using parrot, chirp, and hadoop

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on webcsandit
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...cscpconf
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopJoão Gabriel Lima
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET Journal
 
Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...
Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...
Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...Pushpa
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop TechnologyRahul Sharma
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 

Semelhante a Attaching cloud storage to a campus grid using parrot, chirp, and hadoop (20)

Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
HDFS
HDFSHDFS
HDFS
 
Hadoopppt.pptx
Hadoopppt.pptxHadoopppt.pptx
Hadoopppt.pptx
 
Virtualized Hadoop
Virtualized HadoopVirtualized Hadoop
Virtualized Hadoop
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
 
IRJET- Secured Hadoop Environment
IRJET- Secured Hadoop EnvironmentIRJET- Secured Hadoop Environment
IRJET- Secured Hadoop Environment
 
Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...
Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...
Presentation on cloud computing security issues using HADOOP and HDFS ARCHITE...
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Yarn
YarnYarn
Yarn
 

Mais de João Gabriel Lima

Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationJoão Gabriel Lima
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackJoão Gabriel Lima
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitJoão Gabriel Lima
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência ArtificialJoão Gabriel Lima
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão LinearJoão Gabriel Lima
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoJoão Gabriel Lima
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google HackingJoão Gabriel Lima
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisJoão Gabriel Lima
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...João Gabriel Lima
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoJoão Gabriel Lima
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaJoão Gabriel Lima
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideJoão Gabriel Lima
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?João Gabriel Lima
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...João Gabriel Lima
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosJoão Gabriel Lima
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.jsJoão Gabriel Lima
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptJoão Gabriel Lima
 

Mais de João Gabriel Lima (20)

Cooking with data
Cooking with dataCooking with data
Cooking with data
 
Deep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer SegmentationDeep marketing - Indoor Customer Segmentation
Deep marketing - Indoor Customer Segmentation
 
Aplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full StackAplicações de Alto Desempenho com JHipster Full Stack
Aplicações de Alto Desempenho com JHipster Full Stack
 
Realidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKitRealidade aumentada com react native e ARKit
Realidade aumentada com react native e ARKit
 
JS - IA
JS - IAJS - IA
JS - IA
 
Big data e Inteligência Artificial
Big data e Inteligência ArtificialBig data e Inteligência Artificial
Big data e Inteligência Artificial
 
Mineração de Dados no Weka - Regressão Linear
Mineração de Dados no Weka -  Regressão LinearMineração de Dados no Weka -  Regressão Linear
Mineração de Dados no Weka - Regressão Linear
 
Segurança na Internet - Estudos de caso
Segurança na Internet - Estudos de casoSegurança na Internet - Estudos de caso
Segurança na Internet - Estudos de caso
 
Segurança na Internet - Google Hacking
Segurança na Internet - Google  HackingSegurança na Internet - Google  Hacking
Segurança na Internet - Google Hacking
 
Segurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentaisSegurança na Internet - Conceitos fundamentais
Segurança na Internet - Conceitos fundamentais
 
Web Machine Learning
Web Machine LearningWeb Machine Learning
Web Machine Learning
 
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
 
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - ClusterizaçãoMineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados com RapidMiner + WEKA - Clusterização
 
Mineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e WekaMineração de dados na prática com RapidMiner e Weka
Mineração de dados na prática com RapidMiner e Weka
 
Visualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark sideVisualizacao de dados - Come to the dark side
Visualizacao de dados - Come to the dark side
 
REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?REST x SOAP : Qual abordagem escolher?
REST x SOAP : Qual abordagem escolher?
 
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
 
E-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãosE-trânsito cidadão - IPVA em suas mãos
E-trânsito cidadão - IPVA em suas mãos
 
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js[Estácio - IESAM] Automatizando Tarefas com Gulp.js
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
 
Hackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com JavascriptHackeando a Internet das Coisas com Javascript
Hackeando a Internet das Coisas com Javascript
 

Último

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Último (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Attaching cloud storage to a campus grid using parrot, chirp, and hadoop

  • 1. 2nd IEEE International Conference on Cloud Computing Technology and Science Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop Patrick Donnelly, Peter Bui, Douglas Thain Computer Science and Engineering University of Notre Dame pdonnel3, pbui, dthain@nd.edu Abstract the needs of web search engines and other scalable applica- tions that require highly scalable streaming access to large The Hadoop filesystem is a large scale distributed filesys- datasets. Hadoop excels at dealing with large files on the tem used to manage and quickly process extremely large order of terabytes and petabytes while ensuring data in- data sets. We want to utilize Hadoop to assist with data- tegrity and resiliency through checksums and data replica- intensive workloads in a distributed campus grid environ- tion. Hadoop is typically used to enable the massive pro- ment. Unfortunately, the Hadoop filesystem is not designed cessing of large data files through the use of distributed to work in such an environment easily or securely. We computing abstractions such as Map Reduce [5]. present a solution that bridges the Chirp distributed filesys- In principle, we would like to deploy Hadoop as a scal- tem to Hadoop for simple access to large data sets. Chirp able storage service in a central location, and allow jobs layers on top of Hadoop many grid computing desirables in- running on the campus grid to access the filesystem. Unfor- cluding simple deployment without special privileges, easy tunately, there are a number of practical barriers to accom- access via Parrot, and strong and flexible security Access plishing this. Most applications are not designed to operate Control Lists (ACL). We discuss the challenges involved in with the native Hadoop Java interface. When executing on a using Hadoop on a campus grid and evaluate the perfor- campus grid, it is not possible to install any dependent soft- mance of the combined systems. ware on the local machine. The Hadoop protocol was not designed to be secure over wide area networks, and varies in significant ways across versions of the software, making if difficult to upgrade the system incrementally. 1 Introduction To address each of these problems, we have coupled Hadoop to the Chirp filesystem, and employed the latter to A campus grid is a large distributed system encompass- make the storage interface accessible, usable, secure, and ing all of the computing power found in a large institution, portable. Chirp [15] is a distributed filesystem for grid com- from low-end desktop computers to high end server clus- puting created at Notre Dame and used around the world. It ters. In the last decade, a wide variety of institutions around has been shown to give favorable performance [15] [14] for the world have constructed campus grids by deploying soft- distributed workflows. It is primarily used to easily export ware such as the Condor [16] distributed computing system. a portion of the local filesystem for use by remote work- One of the largest known public campus grids is BoilerGrid, ers. Chirp is built to be easily deployed temporarily for consisting of over 20,000 cores harnessed at Purdue Uni- a job and then torn down later. It supports unprivileged versity [2]; and systems consisting of merely a few thou- deployment, strong authentication options, sensible secu- sand cores are found all around the country [13]. Campus rity of data, and grid-friendly consistency semantics. Using grids have traditionally been most effective at running high- Parrot [12], we can transparently connect the application to throughput computationally-intensive applications. How- Chirp and thereby to Hadoop. ever, as the data needs of applications in the campus grid The remainder of this paper presents our solution for at- increases, it becomes necessary to provide a system that can taching Hadoop to a campus grid via the Chirp filesystem. provide high capacity and scalable I/O performance. We begin with a detailed overview of Hadoop, Parrot, and To address this, we consider the possibility of using Chirp, explaining the limitations of each. We explored two Hadoop [8] as a scalable filesystem to serve the I/O needs distinct solutions to this problem. By connecting Parrot of a campus grid. Hadoop was originally designed to serve directly to Hadoop, we achieve the most scalable perfor- 978-0-7695-4302-4/10 $26.00 © 2010 IEEE 488 DOI 10.1109/CloudCom.2010.74
  • 2. mance, but a less robust system. By connecting Parrot to opers. A FUSE [1] module allows unmodified applica- Chirp and then to Hadoop, we achieve a much more stable tions to access Hadoop through a user-level filesystem system, but with a modest performance penalty. The final driver, but this method still suffers from the remaining section of the paper evaluates the scalability of our solution three problems: on both local area and campus area networks. • Deployment. To access Hadoop by any method, it is necessary to install a large amount of software, in- 2 Background cluding the Java virtual machine and resolve a number of dependencies between components. Using FUSE The following is an overview of the systems we are try- module requires administrator access to install the ing to connect to allow campus grid users access to scalable FUSE kernel modules, install a privileged executable, distributed storage. First, we examine the Hadoop filesys- and update the local administrative group definitions tem and provide a summary of its characteristics and limita- to permit access. On a campus grid consisting of thou- tions. Throughout the paper we refer to the Hadoop filesys- sands of borrowed machines, this level of access is tem as HDFS or simply Hadoop. Next, we introduce the simply not practical. Chirp filesystem, which is used at the University of Notre • Authentication. Hadoop was originally designed with Dame and other research sites for grid storage. Finally, we the assumption that all of its components were oper- describe Parrot, a virtual filesystem adapter that provides ating on the same local area, trusted network. As a applications transparent access to remote storage systems. result, there is no true authentication between compo- nents – when a client connects to Hadoop, it simply 2.1 Hadoop claims to represent principal X, and the server compo- nent accepts that assertion at face value. On a campus- Hadoop [8] is an open source large scale distributed area network, this level of trust is inappropriate, and a filesystem modeled after the Google File System [7]. It pro- stronger authentication method is required. vides extreme scalability on commodity hardware, stream- ing data access, built-in replication, and active storage. • Interoperability. The communication protocols used Hadoop is well known for its extensive use in web search between the various components of Hadoop are also by companies such as Yahoo! as well as by researchers for tightly coupled, and change with each release of the data intensive tasks like genome assembly [10]. software as additional information is added to each in- A key feature of Hadoop is its resiliency to hardware teraction. This presents a problem on a campus grid failure. The designers of Hadoop expect hardware failure with many simultaneous users, because applications to be normal during operation rather than exceptional [3]. may run for weeks or months at a time, requiring up- Hadoop is a ”rack-aware” filesystem that by default repli- grades of both clients and servers to be done indepen- cates all data blocks three times across storage components dently and incrementally. A good solution requires to minimize the possiblty of data loss. both forwards and backwards compatibility. Large data sets are a hallmark of the Hadoop filesystem. File sizes are anticipated to exceed hundreds of gigabytes 2.2 Chirp to terabytes in size. Users may create millions of files in a single data set and still expect excellent performance. The Chirp [15] is a distributed filesystem used in grid com- file coherency model is ”write once, read many”. A file that puting for easy userlevel export and access to file data. It is a is written and then closed cannot be written to again. In regular user level process that requires no privileged access the future, Hadoop is expected to provide support for file during deployment. The user simply specifies a directory on appends. Generally, users do not ever need support for ran- the local filesystem to export for outside access. Chirp of- dom writes to a file in everyday application. fers flexible and robust security through per-directory ACLs Despite many attractive properties, Hadoop has several and various strong authentication modes. Figure 1a pro- essential problems that make it difficult to use from a cam- vides a general overview of the Chirp distributed filesystem pus grid environment: and how it is used. A Chirp server is set up using a simple command as a • Interface. Hadoop provides a native Java API for ac- regular user. The server is given the root directory as an op- cessing the filesystem, as well as a POSIX-like C API, tion to export to outside access. Users can authenticate to which is a thin wrapper around the Java API. In princi- the server using various schemes including the Grid Secu- ple, applications could be re-written to access these in- rity Infrastructure (GSI) [6], Kerberos [11], or hostnames. terfaces directly, but this is a time consuming and intru- Access to data is controlled by a per-directory ACL system sive change that is unattractive to most users and devel- using these authentication credentials. 489
  • 3. tcsh emacs perl appl tcsh emacs perl appl ordinary unix system calls ordinary unix system calls custom custom parrot fuse tools parrot fuse tools file file cache libchirp libchirp libchirp cache libchirp libchirp libchirp chirp network protocol chirp network protocol chirp Hadoop chirp server Headnode server unix filesys Hadoop datanode Hadoop datanode ... Hadoop datanode (a) The Chirp Filesystem (b) The Chirp Filesystem Bridging Hadoop Figure 1: Applications can use Parrot, FUSE, or libchirp directly to communicate with a Chirp server on the client side. The modifications presented in this paper, Chirp can now multiplex between a local unix filesystem and a remote storage service such as Hadoop. A Chirp server requires neither special privileges nor code to a Chirp filesystem without having to interface di- kernel modification to operate. Chirp can both be deployed rectly with libchirp. While applications may also use FUSE to a grid by a regular user to operate on personal data se- to remotely mount a Chirp fileserver, Parrot is more practi- curely. We use Chirp at the University of Notre Dame to cal in a campus grid system where users may not have the support workflows that require a personal file bridge to ex- necessary administrative access to install and run the FUSE pose filesystems such as AFS [9] to grid jobs. Chirp is also kernel module. Because Parrot runs in userland and has used for scratch space that harnesses the extra disk space of minimal dependencies, it is easy to distribute and deploy in various clusters of machines. One can also use Chirp as a restricted campus grid environments. cluster filesystem for greater I/O bandwidth and capacity. 3 Design and Implementation 2.3 Parrot To enable existing applications to access distributed This section describes the changes we have made to im- filesystems such as Chirp, the Parrot [12] virtual filesys- plement a usable and robust system that allows for appli- tem adapter is used to redirect I/O operations for unmod- cations to transparently access the Hadoop filesystem. The ified applications. Normally, applications would need to be first approach is to connect Parrot directly to HDFS, which changed to allow them to access remote storage services provides for scalable performance, but has some limita- such Chirp or Hadoop. With Parrot attached, unmodified tions. The second approach is to connect Parrot to Chirp applications can access remote distributed storage systems and thereby Hadoop, which allows us to overcome these transparently. Because Parrot works by trapping program limitations with a modest performance penalty. system calls through the ptrace debugging interface, it can be deployed and installed without any special system privi- 3.1 Parrot + HDFS leges or kernel changes. One service Parrot supports is the Chirp filesystem pre- Our initial attempt at integrating Hadoop into our cam- viously described. Users may connect to Chirp using a va- pus grid environment was to utilize the Parrot middleware riety of methods. For instance, the libchirp library provides filesystem adapter to communicate with our test Hadoop an API that manages connections to a Chirp server and al- cluster. To do this we used libhdfs, a C/C++ binding for lows user to construct custom applications. However, a user HDFS provided by the Hadoop project, to implement a will generally use Parrot to transparently connect existing HDFS service for Parrot. As with other Parrot services, the 490
  • 4. ever, were fully supported. The parrot hdfs module did not tcsh emacs perl appl provide a means of overcoming this limitation. Finally, the biggest impediment to using both Hadoop and the parrot hdfs module was the lack of proper access ordinary unix system calls controls. Hadoop’s permissions can be entirely ignored as parrot there is no authentication mechanism. The filesystem ac- libhdfs cepts whatever identity the user presents to Hadoop. So while a user may partition the Hadoop filesystem for dif- Hadoop RPC ferent groups and users, any user may report themselves as any identity thus nullifying the permission model. Since parrot hdfs simply provides the underlying access control system, this problem is not mitigated in anyway by the ser- Hadoop Headnode Hadoop Datanode Hadoop Datanode ... Hadoop Datanode vice. For a small isolated cluster filesystem this is not a problem, but for a large campus grid, such a lack of permis- sions is worrisome and a deterent for wide adoption, espe- cially if sensitive research data is being stored. Figure 2: In the parrot hdfs tool, we connect Parrot directly So although parrot hdfs is a useful tool for integrating to Hadoop using libhdfs, which allows client applications to Hadoop into a campus grid environment, it has some limita- communicate with a HDFS service. tions that prevent it from being a complete solution. More- over, Hadoop itself has limitations that need to be addressed before it can be widely deployed and used in a campus grid. HDFS Parrot module (parrot hdfs) allows unmodified end user applications to access remote filesystems, in this case 3.2 Parrot + Chirp + HDFS HDFS, transparently and without any special priviledges. This is different from tools such as FUSE which require To overcome these limitations in both parrot hdfs and kernel modules and thus adminstrative access. Since Par- Hadoop, we decided to connect Parrot to Chirp and then rot is a static binary and requires no special priviledges, it Hadoop. In order to accomplish this, we had to implement can easily be deployed around our campus grid and used to a new virtual multiplexing interface in Chirp to allow it to access remote filesystems. to use any backend filesystem. With this new capability, a Figure 2 shows the architecture of the parrot hdfs mod- Chirp server could act as a frontend to potentially any other ule. Applications run normally until an I/O operation is filesystem, such as Hadoop, which is not usually accessible requested. When this occurs, Parrot traps the system call with normal Unix I/O system calls. using the ptrace system interface and reroutes to the appro- In our modified version of Chirp, the new virtual mul- priate filesystem handler. In the case of Hadoop, any I/O tiplexing interface routes each I/O Remote Procedure Call operation to the Hadoop filesytem is implemented using the (RPC) to the appropriate backend filesystem. For Hadoop, libhdfs library. Of course, the trapping and redirection of all data access and storage on the Chirp server is done system calls incurs some overhead, but as shown in a pre- through Hadoop’s libhdfs C library. Figure 1b illustrates vious paper [4], the parrot hdfs module is quite serviceable how Chirp operates with Hadoop as a backend filesystem. and provides adequate performance. With our new system, a simple command line switch Unfortunately, the parrot hdfs module was not quite a changes the backend filesystem during initialization of the perfect solution for integrating Hadoop into our campus in- server. The command to start the Chirp server remains sim- frastructure. First, while Parrot is lightweight and has mini- ple to start and deploy: mal dependencies, libhdfs, the library required for commu- nicating to Hadoop, does not. In fact, to use libhdfs we $ chirp_server -f hdfs must have access to a whole Java virtual machine installa- -x headnode.hadoop:9100 tion, which is not always possible in a heterogeneous cam- pus grid with multiple administrative domains. Clients that connect to the Chirp server such as the one Second, only sequential writing was supported by the above will work over HDFS instead of on the the server’s module. This is because Hadoop, by design, is a write-once- local filesystem. The head node of Hadoop and port it read-many filesystem and thus is optimized for streaming listens on are specified using the -x <host>:<port> reads rather than frequent writes. Moreover, due to the in- switch. The root of the Hadoop filesystem is the exported stability of the append feature in the version of Hadoop de- directory by Chirp changed via the -r switch. Because ployed in our cluster, it was also not possible to support ap- Hadoop and Java require multiple inconvenient environ- pends to existing files. Random and sequential reads, how- ment variables to be setup prior to execution, we have also 491
  • 5. provided the chirp_server_hdfs wrapper which sets replication and correctly dealing with the (now) muta- these variables to configuration defined values before run- ble last block of a file. As of now, turning appends on ning chirp_server. requires setting experimental parameters in Hadoop. Recall that Hadoop has no internal authentication sys- Chirp provides some support for appends by imple- tem; it relies on external mechanisms. A user may de- menting the operation through other primitives within clare any identity and group membership when establishing Hadoop. That is, Chirp copies the file into memory, a connection with the Hadoop cluster. The model naturally deletes the file, writes the contents of memory back to relies on a trustworthy environment; it is not expected to be the file, and returns the file for further writes – effec- resilient to malicious use. In a campus environment, this tively allowing appends. Support for appends to small may not be the case. So, besides being an excellent tool for files fit most use cases, logs in particular. Appending grid storage in its own right, Chirp brings the ability to wall to a large file is not advised. The solution is naturally off Hadoop from the campus grid while still exposing it in less than optimal but is necessary until Hadoop pro- a secure way. To achieve this, we would expect the admin- vides non-experimental support. istrator to firewall off regular Hadoop access while still ex- posing Chirp servers running on top of Hadoop. This setup is a key difference to parrot hdfs, discussed in the previous • Random Writes Random writes to a file are not al- section. Chirp brings its strong authentication mechanisms lowed by Hadoop. This type of restriction generally and flexible access control lists to protect Hadoop from un- affects programs, like linkers, that do not write to a file intended or accidental access whereas parrot hdfs connects sequentially. In Chirp, writes to areas other than the to an exposed, unprotected Hadoop cluster. end of the file are silently changed to appends instead. 3.3 Limitations when Bridging Hadoop • Other POSIX Incompatibilities Hadoop has other with Chirp peculiar incompatibilities with POSIX. The execute bit for regular files is nonexistent in Hadoop. Chirp gets The new modified Chirp provides users with an approxi- around this by reporting the execute bit set for all users. mate POSIX interface to the Hadoop filesystem. Certain re- Additionally, it does not allow renaming a file to one strictions are unavoidable due to limitations stemming from that already exists. This would be equivalent to the design decisions in HDFS. The API provides most POSIX Unix mv operation from one existing file to another ex- operations and semantics that are needed for our bridge with isting file. Further, opening, unlinking, and closing a Chirp, but not all. We list certain notable problems we have file will result in Hadoop failing with an EINTERNAL encountered here. error. For these types of semantic errors, the Chirp server and client have no way of reasonably respond- • Errno Translation The Hadoop libhdfs C API binding ing or anticipating such an error. We expect that as the interprets exceptions or errors within the Java runtime HDFS binding matures many of these errors will be and sets errno to an appropriate value. These trans- responded to in a more reasonable way. lations are not always accurate and some are missing. A defined errno value within libhdfs is EINTERNAL which is the generic error used when another is not found to be more appropriate. Chirp tries to sensi- 4 Evaluation bly handle these errors by returning a useful status to the user. Some errors must be caught early because Hadoop does not properly handle all error conditions. To evaluate the performance characteristics of our Par- For example, attempting to open a file that is a direc- rot + Chirp + HDFS system, we ran a series of benchmark tory results errno being set to EINTERNAL (an HDFS tests. The first is a set of micro-benchmarks designed to generic EIO error) rather than the correct EISDIR. test the overhead for individual I/O system calls when us- ing Parrot with various filesystems. The second is a data • Appends to a File Appends are largely unsupported transfer performance test of our new system versus the na- in Hadoop within production environments. Hadoop tive Hadoop tools. Our tests ran on a 32 node cluster of went without appends for a long time in its history be- machines (located off Notre Dame’s main campus) with the cause the Map Reduce abstraction did not benefit from Hadoop setup using 15 TB of aggregate capacity. The clus- it. Still, use cases for appends surfaced and work to ter is using approximately 50% of available space. The add support – and determine the appropriate semantics Hadoop cluster shares a 1 gigabit link to communicate with – has been in progress since 2007. The main barri- the main campus network. We have setup a Chirp server as ers to its developement are coherency semantics with a frontend to the Hadoop cluster on one of the data nodes. 492
  • 6. 50 50 2 Clients 2 Clients 4 Clients 4 Clients 8 Clients 8 Clients 40 16 Clients 40 16 Clients Bandwidth (MB/s) Bandwidth (MB/s) 30 30 20 20 10 10 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 File Size (MB) File Size (MB) (a) Campus Network Reads to Chirp+HDFS. (b) Campus Network Reads to HDFS. 50 50 2 Clients 2 Clients 4 Clients 4 Clients 8 Clients 8 Clients 40 16 Clients 40 16 Clients Bandwidth (MB/s) Bandwidth (MB/s) 30 30 20 20 10 10 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 File Size (MB) File Size (MB) (c) Campus Network Writes to Chirp+HDFS. (d) Campus Network Writes to HDFS. Figure 3 35 Parrot+Chirp+Unix The results of these micro-benchmarks are shown in Fig- ure 5. As can be seen, our new Parrot + Chirp + HDFS Latency (Milliseconds) 30 Parrot+HDFS Parrot+Chirp+HDFS 25 systems has some overhead when compared to accessing a 20 Chirp server or HDFS service using Parrot. In particular, the 15 metadata operations of stat and open are particularly heavy 10 in our new system while read and close are relatively close. 5 The reason for this high overhead on metadata-intensive op- 0 erations is because of the access control lists Chirp uses to stat open read close Micro-Operations implement security and permissions. For each stat or open, multiple files must be opened and checked in order to verify Figure 5: Latency of I/O Operations using Microbench valid access, which increases the latencies for these system calls. 4.1 Micro-benchmarks 4.2 Performance benchmarks The first set of tests we ran were a set of micro- For our second set of benchmarks, we stress test the benchmarks that measure the average latency of various I/O Chirp server by determining the throughput achieved when system calls such as stat, open, read, close. For this test we putting and getting very large files. These types of com- wanted to know the cost of putting Chirp in front of HDFS mands are built into Chirp as the RPC’s getfile and and using Parrot to access HDFS through our Chirp server. putfile. Our tests compared the equivalent Hadoop We compared these latencies to those for accessing a Chirp native client hadoop fs get and hadoop fs put server hosting a normal Unix filesystem through Parrot, and commands to the Chirp ones. The computer on the cam- accessing HDFS directly using Parrot. pus network initiating the tests used the Chirp command 493
  • 7. 100 100 2 Clients 2 Clients 4 Clients 4 Clients 8 Clients 8 Clients 80 16 Clients 80 16 Clients Bandwidth (MB/s) Bandwidth (MB/s) 60 60 40 40 20 20 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 File Size (MB) File Size (MB) (a) LAN Network Reads to Chirp+HDFS. (b) LAN Network Reads to HDFS. 50 50 2 Clients 2 Clients 4 Clients 4 Clients 8 Clients 8 Clients 40 16 Clients 40 16 Clients Bandwidth (MB/s) Bandwidth (MB/s) 30 30 20 20 10 10 0 0 0 1000 2000 3000 4000 5000 6000 7000 8000 0 1000 2000 3000 4000 5000 6000 7000 8000 File Size (MB) File Size (MB) (c) LAN Network Writes to Chirp+HDFS. (d) LAN Network Writes to HDFS. Figure 4 line tool, which uses the libchirp library without any ex- chines will follow access the Hadoop storage in this way. ternal dependencies, and the Hadoop native client, which For contrast, in the next tests, we remove this barrier. requires all the Hadoop and Java dependencies. We want to Next, we place the clients on the same LAN as the test how the Chirp server performs under load compared to Hadoop cluster and the Chirp server frontend. We expect the Hadoop cluster by itself. For these tests, we communi- that throughput will not be hindered by the link rate of the cated with the Hadoop cluster and the Chirp server frontend network and parallelism will allow Hadoop to scale. Similar from the main campus. All communication went through to the tests from the main campus grid, the clients will put the 1 gigabit link to the cluster. and get files to the Chirp server and via the Hadoop native The graphs in Figures 3a and 3b show the results of read- client. ing (getting) a file from the Chirp server frontend and via The graphs in Figures 4a and 4b show the results of read- the the native Hadoop client, respectively. We see that the ing a file over the LAN setup. We see that we get simi- Chirp server is able to maintain throughput comparable to lar throughput over the Chirp server as in the main campus the Hadoop native client; the Chirp server does not present tests. While the link to the main campus grid is no longer a real bottleneck to reading a file from Hadoop. In Fig- a limiting factor, the Chirp server’s machine still only has ures 3c and 3d, we have the results of writing (putting) a a gigabit link to the LAN. Meanwhile, Hadoop achieves file to the Chirp server frontend and native Hadoop client, respectably good performance, as expected, when reading again respectively. These graphs similarly show that the files with a large number of clients. The parallelism gives Chirp server obtains similar throughput to the Hadoop na- Hadoop almost twice the throughput. For a larger number tive client for writes. For these tests, the 1 gigabit link plays of clients, Hadoop’s performance outstrips chirp by a large a major role in limiting the exploitable parallelism obtain- factor of 10. The write performance in Figures 4c and 4d, able via writing to Hadoop over a faster (local) network. show similar results. The writes over the LAN bottleneck Despite hamstringing Hadoop’s potential parallelism, we on the gigabit link to the Chirp server while Hadoop main- find these tests valuable because the campus network ma- tains minor performance hits as the number of clients in- 494
  • 8. crease. In both cases, the size of the file did not have a [5] J. Dean and S. Ghemawat. Mapreduce: Simplified significant effect on the average throughput. Because one data processing on large cluster. In Operating Systems (Chirp) server cannot handle the throughput due to net- Design and Implementation, 2004. work link restrictions in a LAN environment, we recom- mend adding more Chirp servers on other machines in the [6] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke. LAN to help distribute the load. A Security Architecture for Computational Grids. In We conclude from these benchmark evaluations that ACM Conference on Computer and Communications Chirp achieves adequate performance compared to Hadoop Security, pages 83–92, San Francisco, CA, November with some overhead. In a campus grid setting we see that 1998. Chirp gets approximately equivalent streaming throughput [7] S. Ghemawat, H. Gobioff, and S. Leung. The Google compared to the native Hadoop client. We find this accept- filesystem. In ACM Symposium on Operating Systems able for the type of environment we are targeting. Principles, 2003. [8] Hadoop. http://hadoop.apache.org/, 2007. 5 Conclusions [9] J. Howard, M. Kazar, S. Menees, D. Nichols, Hadoop is a highly scalable filesystem used in a wide ar- M. Satyanarayanan, R. Sidebotham, and M. West. ray of disciplines but lacks support for campus or grid com- Scale and performance in a distributed file system. puting needs. Hadoop lacks strong authentication mecha- ACM Trans. on Comp. Sys., 6(1):51–81, February nisms and access control. Hadoop also lacks ease of ac- 1988. cess due to the number of dependencies and program re- [10] M. C. Schatz. CloudBurst: highly sensitive read map- quirements. We have designed a solution that enchances the ping with MapReduce. Bioinformatics, 25(11):1363– functionality of Chirp, coupled with Parrot or FUSE, to pro- 1369, 2009. vide a bridge between a user’s work and Hadoop. This new system can be used to correct these problems in Hadoop so [11] J. Steiner, C. Neuman, and J. I. Schiller. Kerberos: An it may be used safely and securely on the grid without sig- authentication service for open network systems. In nificant loss in performance. Proceedings of the USENIX Winter Technical Confer- The new version of the Chirp software is now capable ence, pages 191–200, 1988. of multiplexing the I/O operations on the server so opera- tions are handled by filesystems other than the local Unix [12] D. Thain and M. Livny. Parrot: Transparent user-level system. Chirp being capable of bridging Hadoop to the grid middleware for data-intensive computing. Technical for safe and secure use is a particularly significant conse- Report 1493, University of Wisconsin, Computer Sci- quence of this work. Chirp can be freely downloaded at: ences Department, December 2003. http://www.cse.nd.edu/ ccl/software. [13] D. Thain and M. Livny. How to Measure a Large Open Source Distributed System. Concurrency and Compu- References tation: Practice and Experience, 18(15):1989–2019, 2006. [1] Filesystem in user space. [14] D. Thain and C. Moretti. Efficient access to many http://sourceforge.net/projects/fuse. small files in a filesystem for grid computing. In IEEE Conference on Grid Computing, Austin, TX, Septem- [2] Boilergrid web site. ber 2007. http://www.rcac.purdue.edu/userinfo/resources/boilergrid, 2010. [15] D. Thain, C. Moretti, and J. Hemmes. Chirp: A prac- tical global file system for cluster and grid computing. [3] D. Borthakur. The hadoop distributed file system: Journal of Grid Computing, to appear in 2008. Architecture and design. http://hadoop.apache.org/, 2007. [16] D. Thain, T. Tannenbaum, and M. Livny. Condor and the grid. In F. Berman, G. Fox, and T. Hey, editors, [4] H. Bui, P. Bui, P. Flynn, and D. Thain. ROARS: Grid Computing: Making the Global Infrastructure a A Scalable Repository for Data Intensive Scientific Reality. John Wiley, 2003. Computing. In The Third International Workshop on Data Intensive Distributed Computing at ACM HPDC 2010, 2010. 495