SlideShare uma empresa Scribd logo
University of Calabria
                      Bachelor thesis in
               Computer Engineering

High-throughput computing and opportunistic computing
   for matchmaking processes and indexing processes

       Supervisor                           Bachelor Candidate
       Ing. Carlo Mastroianni               Silvio Sangineto

                                            Matriculation
                                            Number: 83879



                                2007-2008
Contents
   Introduction to the Thesis
   Introduction to Distributed Systems
   Introduction to the Grid, High-throughput Computing and opportunistic computing
   Condor
   Why Condor?
   Introduction to Prototype Architecture
   Centralized prototype architecture
   Centralized Scorer
   Results achieved
   A possible solution: Distributed Scorer
   Distributed Scorer
   New Results achieved
   From “local” business case to the big business case…
Introduction to the Thesis
 Creation of a Distributed Web-Spider with particular attention about
   the efficiency, scalability, energy saving and costs.
 Description:
 The goal of this project is recovery the URLs about          Actually in Italy not exist
 Italian Companies. This recovery is possible because         a complete list about the
 we can use a customer database with general                  Italian Companies that
                                                              have a Web-Site!!!
 informations which: VAT number, phone, emails,
 etc.. These informations can be matched with the
 Web-Site contents so we can find the official Web-
 Site for each company.
                                              Why:
                                              Knowing the Official Web-Site is very
                                              important because you can know quickly:
                                              • contacts and emails about it;
                                              • updates, news preview;
                                              • many descriptions about the Company
                                              activities;
                                              • other informations (e.g. history).
Introduction to the Thesis
Boundary value problems for my thesis:
 Difficulty to estimate how many companies have a Web-Site (Coverage Level);
 The Web-Site structures could have many parts no-standard (some Web-Sites
   couldn’t have information about VAT number, email, etc..) ;
 The updating of the data-base that contains the URLs must allow to catch the Web-
   Site of a new Company and the new Web-Site of an old Company;
 Some problems about privacy (e.g. email).

                                              Relevant problems for my thesis:
           Usually in the Web-Spider that
           exists on the Web (e.g.
                                               Load balancing work, efficient resources
           Google), when they need to         utilization;
           increase the computational          Scalability;
           power the Company buy               Costs;
           other servers to provide it!!!!!    Energy saving.
           (General Solution)
Introduction to the Thesis
We want to find an answer to the relevant
problems in the “local” business case to use these
solutions for the “big” business case !!!
Introduction to Distributed Systems

Definition:
A distributed system consists of a collection of autonomous computers, connected through a network and
distribution middleware, which enables computers to coordinate their activities and to share the resources
of the system, so that users perceive the system as a single, integrated computing facility.

                                                                       In our case we use a
                                                                       distributed system to have
                                                                       more computational power…


                                                                       Advantages of Distributed System:
                                                                        Reliability;
                                                                        Sharing of resources;
                                                                        Aggregate computing power;
                                                                        Scalability;
Grid Computing, High-throughput
     computing and opportunistic computing
Grid Computing: Grids are intrinsically distributed and heterogeneous but must be viewed by
the user (whether an individual or another computer) as a virtual environment with uniform
access to resources. Much of Grid software technology addresses the issues of resource
scheduling, quality of service, fault tolerance, decentralized control and security and so on, which
enable the Grid to be perceived as a single virtual platform by the user.

High-throughput computing:                                   Opportunistic computing:
The goal of a high-throughput computing                      The goal of opportunistic computing is the
Environment is to provide large amounts of                   ability to utilize resources whenever they are
fault-tolerant computational power over                      available, without requiring 100% availability.
prolonged periods of time by effectively
utilizing all resources available to the network.




                          The two goals are naturally coupled. High-throughput computing is most
                           The two goals are naturally coupled. High-throughput computing is most
                          easily achieved through opportunistic means.
                           easily achieved through opportunistic means.
Condor
Modern processing environments that consist of large collections of workstations interconnected
by high capacity network raise the following challenging question: can we satisfy the needs of
users who need extra capacity without lowering the quality of service experienced by the owners of
under utilized workstations? . . . The Condor scheduling system is our answer to this question.

  At the University of Wisconsin, Miron Livny combined his 1983 doctoral thesis on
  cooperative processing with the powerful Crystal Multicomputer designed by
  DeWitt, Finkel, and Solomon and the novel Remote UNIX software designed by
  Litzkow. The result was Condor, a new system for distributed computing.
  The goal of the Condor Project is to develop, implement, deploy, and evaluate
  mechanisms and policies that support High Throughput Computing and
  opportunistic computing on large collections of distributively owned computing
  resources. Guided by both the technological and sociological challenges of such a
  computing environment, the Condor Team has been building software tools that
  enable scientists and engineers to increase their computing throughput. Condor is a
  middleware that allow the users to join and use the distributed resources.
Condor
Condor is a specialized job and a resource management system (RMS) for
computeintensive jobs. Like other full-featured systems, Condor provides a job
management mechanism, scheduling policy, priority scheme, resource monitoring, and
resource management. Users submit their jobs to Condor, and Condor subsequently
chooses when and where to run them based upon a policy, monitors their progress, and
ultimately informs the user upon completion.

         Two very important mechanisms:
          ClassAds: The ClassAd mechanism in Condor provides an extremely flexible and
         expressive framework for matching resource requests (e.g. jobs) with resource offers
         (e.g. machines)
          RemoteSystemCalls: When running jobs on remote machines, Condor can often
         preserve the local execution environment via remote system calls. Remote system calls is
         one of Condor’s mobile sandbox mechanisms for redirecting all of a jobs I/O-related
         system calls back to the machine that submitted the job. Therefore, users do not need to
         make data files available on remote workstations before Condor executes their programs
         there, even in the absence of a shared file system.
Condor
How condor works?
                    This is an example [An agent (A) is shown
                    executing a job on a resource
                    (R) with the help of a matchmaker (M)]:

                    Step 1: The agent and the resource advertise
                    themselves
                    to the matchmaker.
                    Step 2: The matchmaker informs the two
                    parties that they are potentially
                    compatible.
                    Step 3: The agent contacts the resource and
                    executes a job.




                                            This figure shows the
                                            major processes in a
                                            Condor system
Condor
What happen when you have more condor pools?
                        This is an example [An
                        agent (A) is shown executing
                        a job on a resource (R) via
                        direct flocking] :

                        Step 1: The agent and the
                        resource advertise themselves
                        locally.
                        Step 2: The agent
                        is unsatisfied, so it also
                        advertises itself to Condor
                        Pool B.
                        Step 3: The matchmaker (M)
                        informs
                        the two parties that they are
                        potentially compatible.
                        Step 4: The agent contacts the
                        resource and
                        executes a job.
Condor
Condor Universe:
Condor has several runtime environments (called a universe) from which to
choose. The Java Universe was the best for our project (for this first version)
so I could take advantage of portability (heterogeneous system) and it was good for
the “local” business case. A universe for Java programs was added to Condor in
late 2001. This was due to a growing community of scientific users that wished to
perform simulations and other work in Java. Although such programs might run
slower than native code, such losses were offset by faster development times and
access to larger numbers of machines.
Why Condor?
• We used Condor because (some motivations):
1) Efficient resource management (opportunistic computing and
   high-throughput computing, ClassAds, etc..);
2) It’s a middleware for heterogeneous Distributed Systems (e.g.
   we can use different types of Operative Systems);
3) It’s an open source project and It’s used in many projects in
   the world like batch system;
4) Flexibility.
Introduction to Centralized Prototype
              Architecture
Web-Sites
                                         Customer
                                          Customer
                                         Data-Base
                                          Data-Base

                                              Identifying
                                              information

            Make              Query
            Index             Results
 Crawler
  Crawler           Index
                     Index                Scorer
                                           Scorer


                        New Companies,        Candidates
                        New Web-Sites



                       Updater            Validator
                                           Validator        Data-Base
                                                             Data-Base
                                           Manual
                                            Manual            URLs
                                                               URLs
Introduction to Centralized Prototype
            Architecture
Crawler:
The prototype Web-Spider must have a Crawler that make an Index of the companies Web-
Sites (e.g. UbiCrawler). This Crawler can be hired by us or we can build a new Crawler on
the basis of several products already ready (Nutch, Heritrix, Jspider, etc.). In this business
case we used the data extract throught theUbiCrawler. For indexing processes we used
Managing GigaByte (MG4J).

Consumer Data-Base:
This database contains the identifying information about the Companies: VAT number,
phone, mails, company name, sign, etc..

                       Scorer:
                       In this step there is the execution of several query and
                       many matchmaking processes to find the right “match” between
                       identifying information and the companies Web-Sites. Each match
                       will have a score.
Centralized Scorer




        Class Diagram - These are the most important classes,
         where we can see the principal processes of the Web-
             Spider (together with the indexing processes)
Centralized Scorer
Into the Centralized Scorer we have the following activities:



                                                                   Score


   Query       Query           Query    Check   Check
   over the    over the        over     over    over
   Phone and   address         the      the     the
   VAT                         other    URL     type
   Number                      fields   name    page

All these activities are completed in about 5 seconds (average), so
to complete the analysis of a Company you need to wait this time.
If you have to analysee 56.000 company you have to wait about
280.000 seconds!!!

                      There is a big problem: the number of the Companies can be
                      very high !!!!
Centralized Scorer
We can glance at the java code that implements some functions:




             AssociaDomini constructor
             In this Class is implemented principally the logic that
             allows the “match” between the identifying
             information and the companies’ Web-Sites.
Centralized Scorer




1/3 – Associa() method      2/3 – Associa() method
Centralized Scorer



3/3 How the method called
associa() record the results
on a log file


                               We preferred to use hibernate
                               because it’s an open source
                               java persistence framework
                               project. Perform powerfull
                               object relational mapping.
Results Achieved
On a sample of 56000 companies:
 Query   Coverag Coverag     Phone and VAT Number:
         e (#)   e (%)       These types of query are very good for the coverage and for
                             the reliability.
 Sign    2747      4,43%
 Phone   25715     41,47%    Sign:
                             Low coverage
 VAT    4369       7,05%                                              Query          Precision
 Number                      Company Name:                                           (%)
                             Very good coverage but low precision
 Compan 27487      44,33%                                             Sign           1%
 y Name                                                               Phone          25%
                                                                      VAT            55%
                 How many companies can you cover with                Number
                 these queries?                                       Company 3%
                 What precision can you achieve?                      Name
Results Achieved
                                                                            For a sample of 56000   companies:
                                 Trend (S)                                  1 Personal Computer works for 77h (only
300000
                                                                            for this computation)
250000
200000                                                                      Personal Computer used:
150000                                                                      Computer Desktop, Intel Dual Core 2,4 GHZ, 2 GB
100000                                                          Trend (S)   di Ram e 1 TB di HDisk
50000
                                                                            Possible Problems:
    0                                                                        the personal computer goes down;
         500 Companies 1000 Companies     10000       56000                  there are new Companies (updating) or some
                                        Companies   Companies
                                                                            Web-Sites are changed, in this case the computation
                                                                            must continue…

                  The matchmaking processes                                  For 1.000.000 companies that you have to
                  and indexing processes are                                 analyse:
                  frequent in the time!!!!
                                                                             1 Personal Computer works about for 1389
                                                                             days. It’s an ideal case…
                                                                             This isn’t a scalable solution!
A possible solution:
                 Distributed Scorer
We want to make a scalable solution for our Web-Spider.

There are some important constraints that we have to respect:

1) Energy Saving;
2) Efficient resources management and efficient resources utilization;
3) Cost cutting;
4) Having more companies analysed in a long time;




                                                      We can submit each set
                                                      of queries on a different
                                                      computer !!!
Distributed Scorer
We built a distributed scorer using the Condor middleware.

                                     This is a possible architecture
                                     where execute our Distributed
                                     Scorer.




                                   Example of
                                   architecture used by
                                   the National Institute
                                   of Nuclear Physics
Distributed Scorer
                    We built a wrapper class to prepare
 We used the        the work environment on Condor.
 vertical           This class realize the logic
 distribution and   connection between the application
 the horizontal     and Condor. This class is runned on
 distribution.      the Server (Central Manager).
Distributed Scorer
               Builder Job
                makeFile           Central
                (impresa)          Manager
                Execute()



                                     Job

                             Job
         Job      Job
Distributed Scorer
  We can see some tests on Condor for our application:




Some examples about Submit Description Files, these files are used by Condor for the
matchmaking processes between the resources and the Jobs.




                                                                      This is our Condor Pool
                                                                      during the tests
Distributed Scorer


Our application submit the jobs…                       We can check the status for our jobs…




     If we have more jobs… we can check the
     status for our resources…                    We can check the status for our jobs…


                           Now, we have to check which results we achieved with this
                           Distributed Scorer!! What is better?
New Results Achieved
                                                   (1) We have an excellent
                                                   work load balancing and
                                                   efficient      resources
                                                   utilization…




7000000                                               (2) We can see how is possible increase the
6000000
                                                      number of computations in a period of time
                                                      (using High-throughput computing).
5000000
                                                      It works even better if we have a sample of
4000000                                     1 PC      Companies higher. (+ Scalability!)
3000000                                     50 PCs      3000000
                                            500 PCs
2000000                                                 2500000
                                                                                                                 Marginal
1000000                                                 2000000                             Seconds for 56000
                                                                                            companies
                                                                                                                 Gain
                                                        1500000
     0                                                                                      Seconds for 250000
                                                                                            companies
          10 hours   50 hours   500 hours               1000000
                                                                                            Seconds for 500000
                                                                                            companies
                                                        500000

                                                             0
                                                                  1 PC   50 PCs   500 PCs
New Results Achieved
We can think to use the internet user’s machine when they are in an inactive mode… or
we can use the companies’ machines because they can use our web-spider for direct
marketing…
Make profit with your idle CPU cycles!
  1.200.000 €

  1.000.000 €

   800.000 €
                                                    Energy cost for the Company
                                                    (every year) (with owner
                                                                                  You can economize much
   600.000 €
                                                    machines)                     money and much energy
   400.000 €                                        How does the energy cost
                                                    increase in a year? (using
                                                                                  saving!!!
   200.000 €                                        users' machines)


          0€
                1 Macchina 50 Macchine     500
                                         Macchine
From “local” business case to the
      big business case…
                     The Googleplex is the corporate
                     headquarters complex of Google, Inc.,
                     located at 1600 Amphitheatre Parkway
                     in Mountain View, Santa Clara County,
                     California, near San Jose.
                     Google purchased some of Silicon
                     Graphics' properties, including the
                     Googleplex, for $319 million.
                     In late 2006 and early 2007 the company
                     installed a series of solar panels, capable
                     of producing 1.6 megawatts of
                     electricity. At the time, it was believed to
                     be the largest corporate installation in
                     the United States. About 30 percent of
                     the Googleplex's electricity needs will be
                     fulfilled by this project, with the
                     remainder being purchased.

Mais conteúdo relacionado

Mais procurados

Cloud Computing - Foundations, Perspectives & Challenges
Cloud Computing - Foundations, Perspectives & ChallengesCloud Computing - Foundations, Perspectives & Challenges
Cloud Computing - Foundations, Perspectives & Challenges
Prasad Chitta
 
Assessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsAssessing no sql databases for telecom applications
Assessing no sql databases for telecom applications
João Gabriel Lima
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
Jazz Yao-Tsung Wang
 
Cloud computing security through symmetric cipher model
Cloud computing security through symmetric cipher modelCloud computing security through symmetric cipher model
Cloud computing security through symmetric cipher model
ijcsit
 
Iirdem a novel approach for enhancing security in multi cloud environment
Iirdem a novel approach for enhancing security in multi  cloud environmentIirdem a novel approach for enhancing security in multi  cloud environment
Iirdem a novel approach for enhancing security in multi cloud environment
Iaetsd Iaetsd
 
Managing A Cloud Environment: How To Get Started And Which Way To Go
Managing A Cloud Environment: How To Get Started And Which Way To Go Managing A Cloud Environment: How To Get Started And Which Way To Go
Managing A Cloud Environment: How To Get Started And Which Way To Go
talemadi
 
Cloud computing security_perspective
Cloud computing security_perspectiveCloud computing security_perspective
Cloud computing security_perspective
solaigoundan
 
FYP%3A+Peer-to-Peer+Communication+Framework+on+Android+Platform
FYP%3A+Peer-to-Peer+Communication+Framework+on+Android+PlatformFYP%3A+Peer-to-Peer+Communication+Framework+on+Android+Platform
FYP%3A+Peer-to-Peer+Communication+Framework+on+Android+Platform
Tianwei_liu
 

Mais procurados (20)

484 488
484 488484 488
484 488
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Paper id 27201433
Paper id 27201433Paper id 27201433
Paper id 27201433
 
Fundamental concepts and models
Fundamental concepts and modelsFundamental concepts and models
Fundamental concepts and models
 
Cloud Computing - Foundations, Perspectives & Challenges
Cloud Computing - Foundations, Perspectives & ChallengesCloud Computing - Foundations, Perspectives & Challenges
Cloud Computing - Foundations, Perspectives & Challenges
 
Assessing no sql databases for telecom applications
Assessing no sql databases for telecom applicationsAssessing no sql databases for telecom applications
Assessing no sql databases for telecom applications
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
 
119 125
119 125119 125
119 125
 
489 493
489 493489 493
489 493
 
iStart hitchhikers guide to cloud computing
iStart hitchhikers guide to cloud computingiStart hitchhikers guide to cloud computing
iStart hitchhikers guide to cloud computing
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
High Performance Distributed Computing with DDS and Scala
High Performance Distributed Computing with DDS and ScalaHigh Performance Distributed Computing with DDS and Scala
High Performance Distributed Computing with DDS and Scala
 
Cloud Computing Hype or Next Big Thing
Cloud Computing Hype or Next Big ThingCloud Computing Hype or Next Big Thing
Cloud Computing Hype or Next Big Thing
 
Cloud computing security through symmetric cipher model
Cloud computing security through symmetric cipher modelCloud computing security through symmetric cipher model
Cloud computing security through symmetric cipher model
 
An Overview To Cloud Computing
An Overview To Cloud ComputingAn Overview To Cloud Computing
An Overview To Cloud Computing
 
Iirdem a novel approach for enhancing security in multi cloud environment
Iirdem a novel approach for enhancing security in multi  cloud environmentIirdem a novel approach for enhancing security in multi  cloud environment
Iirdem a novel approach for enhancing security in multi cloud environment
 
BCBSA Summit - Cloud Computing Issues (Dec 2012)
BCBSA Summit - Cloud Computing Issues (Dec 2012)BCBSA Summit - Cloud Computing Issues (Dec 2012)
BCBSA Summit - Cloud Computing Issues (Dec 2012)
 
Managing A Cloud Environment: How To Get Started And Which Way To Go
Managing A Cloud Environment: How To Get Started And Which Way To Go Managing A Cloud Environment: How To Get Started And Which Way To Go
Managing A Cloud Environment: How To Get Started And Which Way To Go
 
Cloud computing security_perspective
Cloud computing security_perspectiveCloud computing security_perspective
Cloud computing security_perspective
 
FYP%3A+Peer-to-Peer+Communication+Framework+on+Android+Platform
FYP%3A+Peer-to-Peer+Communication+Framework+on+Android+PlatformFYP%3A+Peer-to-Peer+Communication+Framework+on+Android+Platform
FYP%3A+Peer-to-Peer+Communication+Framework+on+Android+Platform
 

Destaque

Bbc three, young people and faith
Bbc three, young people and faithBbc three, young people and faith
Bbc three, young people and faith
Ruth Deller
 
Vancouver real estate february 2012 stats package rebgv
Vancouver real estate february 2012 stats package rebgvVancouver real estate february 2012 stats package rebgv
Vancouver real estate february 2012 stats package rebgv
Matt Collinge
 
Biotechnology (Malay)
Biotechnology (Malay)Biotechnology (Malay)
Biotechnology (Malay)
Maliney Pohs
 
Web development(kewal)
Web development(kewal)Web development(kewal)
Web development(kewal)
Kewal Pradhan
 

Destaque (20)

CSL In Session - Colorado Public Library Trends
CSL In Session - Colorado Public Library Trends CSL In Session - Colorado Public Library Trends
CSL In Session - Colorado Public Library Trends
 
Trigger Warning Workshop, Sexual Cultures 2, 2015 with Meg John Barker and C...
Trigger Warning Workshop,  Sexual Cultures 2, 2015 with Meg John Barker and C...Trigger Warning Workshop,  Sexual Cultures 2, 2015 with Meg John Barker and C...
Trigger Warning Workshop, Sexual Cultures 2, 2015 with Meg John Barker and C...
 
Bbc three, young people and faith
Bbc three, young people and faithBbc three, young people and faith
Bbc three, young people and faith
 
Data center sper sys
Data center sper sysData center sper sys
Data center sper sys
 
Vancouver real estate february 2012 stats package rebgv
Vancouver real estate february 2012 stats package rebgvVancouver real estate february 2012 stats package rebgv
Vancouver real estate february 2012 stats package rebgv
 
15 January 2013 1ABCT Weekly Newsletter
15 January 2013 1ABCT Weekly Newsletter15 January 2013 1ABCT Weekly Newsletter
15 January 2013 1ABCT Weekly Newsletter
 
25 Jan 2013 Network Meeting Slides
25 Jan 2013 Network Meeting Slides 25 Jan 2013 Network Meeting Slides
25 Jan 2013 Network Meeting Slides
 
5 December 2012 1ABCT Weekly Newsletter
5 December 2012 1ABCT Weekly Newsletter5 December 2012 1ABCT Weekly Newsletter
5 December 2012 1ABCT Weekly Newsletter
 
Biotechnology (Malay)
Biotechnology (Malay)Biotechnology (Malay)
Biotechnology (Malay)
 
28 Nov 2012 1ABCT Weekly Newsletter
28 Nov 2012 1ABCT Weekly Newsletter28 Nov 2012 1ABCT Weekly Newsletter
28 Nov 2012 1ABCT Weekly Newsletter
 
bladder cancer
bladder cancerbladder cancer
bladder cancer
 
Black & Veatch Power Delivery Overview
Black & Veatch Power Delivery OverviewBlack & Veatch Power Delivery Overview
Black & Veatch Power Delivery Overview
 
Str8ts Christmas Puzzle by Ulrich
Str8ts Christmas Puzzle by UlrichStr8ts Christmas Puzzle by Ulrich
Str8ts Christmas Puzzle by Ulrich
 
Managing Information Overload
Managing Information OverloadManaging Information Overload
Managing Information Overload
 
Str8ts Weekly Extreme #45 - Solution
Str8ts Weekly Extreme #45 - SolutionStr8ts Weekly Extreme #45 - Solution
Str8ts Weekly Extreme #45 - Solution
 
Web development(kewal)
Web development(kewal)Web development(kewal)
Web development(kewal)
 
Blogging with passion and authority
Blogging with passion and authorityBlogging with passion and authority
Blogging with passion and authority
 
Str8ts: Solution to Weekly Extreme Str8ts #30
Str8ts: Solution to Weekly Extreme Str8ts #30Str8ts: Solution to Weekly Extreme Str8ts #30
Str8ts: Solution to Weekly Extreme Str8ts #30
 
bio5.2
bio5.2bio5.2
bio5.2
 
Sociology group
Sociology groupSociology group
Sociology group
 

Semelhante a High-throughput computing and opportunistic computing for matchmaking processes and indexing processes

Cloud Computing Introduction
Cloud Computing IntroductionCloud Computing Introduction
Cloud Computing Introduction
guest90f660
 
What are the security requirements and challenges of Grid and Cloud .pdf
What are the security requirements and challenges of Grid and Cloud .pdfWhat are the security requirements and challenges of Grid and Cloud .pdf
What are the security requirements and challenges of Grid and Cloud .pdf
arishmarketing21
 

Semelhante a High-throughput computing and opportunistic computing for matchmaking processes and indexing processes (20)

Introduction To Cloud Computing
Introduction To Cloud ComputingIntroduction To Cloud Computing
Introduction To Cloud Computing
 
htcia-5-2015
htcia-5-2015htcia-5-2015
htcia-5-2015
 
Cloud Computing Introduction
Cloud Computing IntroductionCloud Computing Introduction
Cloud Computing Introduction
 
Introduction Of Cloud Computing
Introduction Of Cloud ComputingIntroduction Of Cloud Computing
Introduction Of Cloud Computing
 
Cloudcomputing
Cloudcomputing Cloudcomputing
Cloudcomputing
 
Cloud Computing_2015_03_05
Cloud Computing_2015_03_05Cloud Computing_2015_03_05
Cloud Computing_2015_03_05
 
Ijirsm choudhari-priyanka-backup-and-restore-in-smartphone-using-mobile-cloud...
Ijirsm choudhari-priyanka-backup-and-restore-in-smartphone-using-mobile-cloud...Ijirsm choudhari-priyanka-backup-and-restore-in-smartphone-using-mobile-cloud...
Ijirsm choudhari-priyanka-backup-and-restore-in-smartphone-using-mobile-cloud...
 
cloud computing basics
cloud computing basicscloud computing basics
cloud computing basics
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
An Overview on Security Issues in Cloud Computing
An Overview on Security Issues in Cloud ComputingAn Overview on Security Issues in Cloud Computing
An Overview on Security Issues in Cloud Computing
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
CC01.pptx
CC01.pptxCC01.pptx
CC01.pptx
 
cloud computing based its solutions term paper
cloud computing based its solutions term papercloud computing based its solutions term paper
cloud computing based its solutions term paper
 
Latest development of cloud computing technology, characteristics, challenge,...
Latest development of cloud computing technology, characteristics, challenge,...Latest development of cloud computing technology, characteristics, challenge,...
Latest development of cloud computing technology, characteristics, challenge,...
 
www.iosrjournals.org 57 | Page Latest development of cloud computing technolo...
www.iosrjournals.org 57 | Page Latest development of cloud computing technolo...www.iosrjournals.org 57 | Page Latest development of cloud computing technolo...
www.iosrjournals.org 57 | Page Latest development of cloud computing technolo...
 
Cloud computings
Cloud computingsCloud computings
Cloud computings
 
cc.doc
cc.doccc.doc
cc.doc
 
What are the security requirements and challenges of Grid and Cloud .pdf
What are the security requirements and challenges of Grid and Cloud .pdfWhat are the security requirements and challenges of Grid and Cloud .pdf
What are the security requirements and challenges of Grid and Cloud .pdf
 
Cloud computing
Cloud computing Cloud computing
Cloud computing
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 

Último

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Último (20)

Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Server-Driven User Interface (SDUI) at Priceline
Server-Driven User Interface (SDUI) at PricelineServer-Driven User Interface (SDUI) at Priceline
Server-Driven User Interface (SDUI) at Priceline
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 

High-throughput computing and opportunistic computing for matchmaking processes and indexing processes

  • 1. University of Calabria Bachelor thesis in Computer Engineering High-throughput computing and opportunistic computing for matchmaking processes and indexing processes Supervisor Bachelor Candidate Ing. Carlo Mastroianni Silvio Sangineto Matriculation Number: 83879 2007-2008
  • 2. Contents  Introduction to the Thesis  Introduction to Distributed Systems  Introduction to the Grid, High-throughput Computing and opportunistic computing  Condor  Why Condor?  Introduction to Prototype Architecture  Centralized prototype architecture  Centralized Scorer  Results achieved  A possible solution: Distributed Scorer  Distributed Scorer  New Results achieved  From “local” business case to the big business case…
  • 3. Introduction to the Thesis  Creation of a Distributed Web-Spider with particular attention about the efficiency, scalability, energy saving and costs. Description: The goal of this project is recovery the URLs about Actually in Italy not exist Italian Companies. This recovery is possible because a complete list about the we can use a customer database with general Italian Companies that have a Web-Site!!! informations which: VAT number, phone, emails, etc.. These informations can be matched with the Web-Site contents so we can find the official Web- Site for each company. Why: Knowing the Official Web-Site is very important because you can know quickly: • contacts and emails about it; • updates, news preview; • many descriptions about the Company activities; • other informations (e.g. history).
  • 4. Introduction to the Thesis Boundary value problems for my thesis:  Difficulty to estimate how many companies have a Web-Site (Coverage Level);  The Web-Site structures could have many parts no-standard (some Web-Sites couldn’t have information about VAT number, email, etc..) ;  The updating of the data-base that contains the URLs must allow to catch the Web- Site of a new Company and the new Web-Site of an old Company;  Some problems about privacy (e.g. email). Relevant problems for my thesis: Usually in the Web-Spider that exists on the Web (e.g.  Load balancing work, efficient resources Google), when they need to utilization; increase the computational  Scalability; power the Company buy  Costs; other servers to provide it!!!!!  Energy saving. (General Solution)
  • 5. Introduction to the Thesis We want to find an answer to the relevant problems in the “local” business case to use these solutions for the “big” business case !!!
  • 6. Introduction to Distributed Systems Definition: A distributed system consists of a collection of autonomous computers, connected through a network and distribution middleware, which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility. In our case we use a distributed system to have more computational power… Advantages of Distributed System:  Reliability;  Sharing of resources;  Aggregate computing power;  Scalability;
  • 7. Grid Computing, High-throughput computing and opportunistic computing Grid Computing: Grids are intrinsically distributed and heterogeneous but must be viewed by the user (whether an individual or another computer) as a virtual environment with uniform access to resources. Much of Grid software technology addresses the issues of resource scheduling, quality of service, fault tolerance, decentralized control and security and so on, which enable the Grid to be perceived as a single virtual platform by the user. High-throughput computing: Opportunistic computing: The goal of a high-throughput computing The goal of opportunistic computing is the Environment is to provide large amounts of ability to utilize resources whenever they are fault-tolerant computational power over available, without requiring 100% availability. prolonged periods of time by effectively utilizing all resources available to the network. The two goals are naturally coupled. High-throughput computing is most The two goals are naturally coupled. High-throughput computing is most easily achieved through opportunistic means. easily achieved through opportunistic means.
  • 8. Condor Modern processing environments that consist of large collections of workstations interconnected by high capacity network raise the following challenging question: can we satisfy the needs of users who need extra capacity without lowering the quality of service experienced by the owners of under utilized workstations? . . . The Condor scheduling system is our answer to this question. At the University of Wisconsin, Miron Livny combined his 1983 doctoral thesis on cooperative processing with the powerful Crystal Multicomputer designed by DeWitt, Finkel, and Solomon and the novel Remote UNIX software designed by Litzkow. The result was Condor, a new system for distributed computing. The goal of the Condor Project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing and opportunistic computing on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Condor Team has been building software tools that enable scientists and engineers to increase their computing throughput. Condor is a middleware that allow the users to join and use the distributed resources.
  • 9. Condor Condor is a specialized job and a resource management system (RMS) for computeintensive jobs. Like other full-featured systems, Condor provides a job management mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their jobs to Condor, and Condor subsequently chooses when and where to run them based upon a policy, monitors their progress, and ultimately informs the user upon completion. Two very important mechanisms:  ClassAds: The ClassAd mechanism in Condor provides an extremely flexible and expressive framework for matching resource requests (e.g. jobs) with resource offers (e.g. machines)  RemoteSystemCalls: When running jobs on remote machines, Condor can often preserve the local execution environment via remote system calls. Remote system calls is one of Condor’s mobile sandbox mechanisms for redirecting all of a jobs I/O-related system calls back to the machine that submitted the job. Therefore, users do not need to make data files available on remote workstations before Condor executes their programs there, even in the absence of a shared file system.
  • 10. Condor How condor works? This is an example [An agent (A) is shown executing a job on a resource (R) with the help of a matchmaker (M)]: Step 1: The agent and the resource advertise themselves to the matchmaker. Step 2: The matchmaker informs the two parties that they are potentially compatible. Step 3: The agent contacts the resource and executes a job. This figure shows the major processes in a Condor system
  • 11. Condor What happen when you have more condor pools? This is an example [An agent (A) is shown executing a job on a resource (R) via direct flocking] : Step 1: The agent and the resource advertise themselves locally. Step 2: The agent is unsatisfied, so it also advertises itself to Condor Pool B. Step 3: The matchmaker (M) informs the two parties that they are potentially compatible. Step 4: The agent contacts the resource and executes a job.
  • 12. Condor Condor Universe: Condor has several runtime environments (called a universe) from which to choose. The Java Universe was the best for our project (for this first version) so I could take advantage of portability (heterogeneous system) and it was good for the “local” business case. A universe for Java programs was added to Condor in late 2001. This was due to a growing community of scientific users that wished to perform simulations and other work in Java. Although such programs might run slower than native code, such losses were offset by faster development times and access to larger numbers of machines.
  • 13. Why Condor? • We used Condor because (some motivations): 1) Efficient resource management (opportunistic computing and high-throughput computing, ClassAds, etc..); 2) It’s a middleware for heterogeneous Distributed Systems (e.g. we can use different types of Operative Systems); 3) It’s an open source project and It’s used in many projects in the world like batch system; 4) Flexibility.
  • 14. Introduction to Centralized Prototype Architecture Web-Sites Customer Customer Data-Base Data-Base Identifying information Make Query Index Results Crawler Crawler Index Index Scorer Scorer New Companies, Candidates New Web-Sites Updater Validator Validator Data-Base Data-Base Manual Manual URLs URLs
  • 15. Introduction to Centralized Prototype Architecture Crawler: The prototype Web-Spider must have a Crawler that make an Index of the companies Web- Sites (e.g. UbiCrawler). This Crawler can be hired by us or we can build a new Crawler on the basis of several products already ready (Nutch, Heritrix, Jspider, etc.). In this business case we used the data extract throught theUbiCrawler. For indexing processes we used Managing GigaByte (MG4J). Consumer Data-Base: This database contains the identifying information about the Companies: VAT number, phone, mails, company name, sign, etc.. Scorer: In this step there is the execution of several query and many matchmaking processes to find the right “match” between identifying information and the companies Web-Sites. Each match will have a score.
  • 16. Centralized Scorer Class Diagram - These are the most important classes, where we can see the principal processes of the Web- Spider (together with the indexing processes)
  • 17. Centralized Scorer Into the Centralized Scorer we have the following activities: Score Query Query Query Check Check over the over the over over over Phone and address the the the VAT other URL type Number fields name page All these activities are completed in about 5 seconds (average), so to complete the analysis of a Company you need to wait this time. If you have to analysee 56.000 company you have to wait about 280.000 seconds!!! There is a big problem: the number of the Companies can be very high !!!!
  • 18. Centralized Scorer We can glance at the java code that implements some functions: AssociaDomini constructor In this Class is implemented principally the logic that allows the “match” between the identifying information and the companies’ Web-Sites.
  • 19. Centralized Scorer 1/3 – Associa() method 2/3 – Associa() method
  • 20. Centralized Scorer 3/3 How the method called associa() record the results on a log file We preferred to use hibernate because it’s an open source java persistence framework project. Perform powerfull object relational mapping.
  • 21. Results Achieved On a sample of 56000 companies: Query Coverag Coverag Phone and VAT Number: e (#) e (%) These types of query are very good for the coverage and for the reliability. Sign 2747 4,43% Phone 25715 41,47% Sign: Low coverage VAT 4369 7,05% Query Precision Number Company Name: (%) Very good coverage but low precision Compan 27487 44,33% Sign 1% y Name Phone 25% VAT 55% How many companies can you cover with Number these queries? Company 3% What precision can you achieve? Name
  • 22. Results Achieved For a sample of 56000 companies: Trend (S) 1 Personal Computer works for 77h (only 300000 for this computation) 250000 200000 Personal Computer used: 150000 Computer Desktop, Intel Dual Core 2,4 GHZ, 2 GB 100000 Trend (S) di Ram e 1 TB di HDisk 50000 Possible Problems: 0  the personal computer goes down; 500 Companies 1000 Companies 10000 56000  there are new Companies (updating) or some Companies Companies Web-Sites are changed, in this case the computation must continue… The matchmaking processes For 1.000.000 companies that you have to and indexing processes are analyse: frequent in the time!!!! 1 Personal Computer works about for 1389 days. It’s an ideal case… This isn’t a scalable solution!
  • 23. A possible solution: Distributed Scorer We want to make a scalable solution for our Web-Spider. There are some important constraints that we have to respect: 1) Energy Saving; 2) Efficient resources management and efficient resources utilization; 3) Cost cutting; 4) Having more companies analysed in a long time; We can submit each set of queries on a different computer !!!
  • 24. Distributed Scorer We built a distributed scorer using the Condor middleware. This is a possible architecture where execute our Distributed Scorer. Example of architecture used by the National Institute of Nuclear Physics
  • 25. Distributed Scorer We built a wrapper class to prepare We used the the work environment on Condor. vertical This class realize the logic distribution and connection between the application the horizontal and Condor. This class is runned on distribution. the Server (Central Manager).
  • 26. Distributed Scorer Builder Job makeFile Central (impresa) Manager Execute() Job Job Job Job
  • 27. Distributed Scorer We can see some tests on Condor for our application: Some examples about Submit Description Files, these files are used by Condor for the matchmaking processes between the resources and the Jobs. This is our Condor Pool during the tests
  • 28. Distributed Scorer Our application submit the jobs… We can check the status for our jobs… If we have more jobs… we can check the status for our resources… We can check the status for our jobs… Now, we have to check which results we achieved with this Distributed Scorer!! What is better?
  • 29. New Results Achieved (1) We have an excellent work load balancing and efficient resources utilization… 7000000 (2) We can see how is possible increase the 6000000 number of computations in a period of time (using High-throughput computing). 5000000 It works even better if we have a sample of 4000000 1 PC Companies higher. (+ Scalability!) 3000000 50 PCs 3000000 500 PCs 2000000 2500000 Marginal 1000000 2000000 Seconds for 56000 companies Gain 1500000 0 Seconds for 250000 companies 10 hours 50 hours 500 hours 1000000 Seconds for 500000 companies 500000 0 1 PC 50 PCs 500 PCs
  • 30. New Results Achieved We can think to use the internet user’s machine when they are in an inactive mode… or we can use the companies’ machines because they can use our web-spider for direct marketing… Make profit with your idle CPU cycles! 1.200.000 € 1.000.000 € 800.000 € Energy cost for the Company (every year) (with owner You can economize much 600.000 € machines) money and much energy 400.000 € How does the energy cost increase in a year? (using saving!!! 200.000 € users' machines) 0€ 1 Macchina 50 Macchine 500 Macchine
  • 31. From “local” business case to the big business case… The Googleplex is the corporate headquarters complex of Google, Inc., located at 1600 Amphitheatre Parkway in Mountain View, Santa Clara County, California, near San Jose. Google purchased some of Silicon Graphics' properties, including the Googleplex, for $319 million. In late 2006 and early 2007 the company installed a series of solar panels, capable of producing 1.6 megawatts of electricity. At the time, it was believed to be the largest corporate installation in the United States. About 30 percent of the Googleplex's electricity needs will be fulfilled by this project, with the remainder being purchased.