High-throughput computing and opportunistic computing for matchmaking processes and indexing processes
1. University of Calabria
Bachelor thesis in
Computer Engineering
High-throughput computing and opportunistic computing
for matchmaking processes and indexing processes
Supervisor Bachelor Candidate
Ing. Carlo Mastroianni Silvio Sangineto
Matriculation
Number: 83879
2007-2008
2. Contents
Introduction to the Thesis
Introduction to Distributed Systems
Introduction to the Grid, High-throughput Computing and opportunistic computing
Condor
Why Condor?
Introduction to Prototype Architecture
Centralized prototype architecture
Centralized Scorer
Results achieved
A possible solution: Distributed Scorer
Distributed Scorer
New Results achieved
From “local” business case to the big business case…
3. Introduction to the Thesis
Creation of a Distributed Web-Spider with particular attention about
the efficiency, scalability, energy saving and costs.
Description:
The goal of this project is recovery the URLs about Actually in Italy not exist
Italian Companies. This recovery is possible because a complete list about the
we can use a customer database with general Italian Companies that
have a Web-Site!!!
informations which: VAT number, phone, emails,
etc.. These informations can be matched with the
Web-Site contents so we can find the official Web-
Site for each company.
Why:
Knowing the Official Web-Site is very
important because you can know quickly:
• contacts and emails about it;
• updates, news preview;
• many descriptions about the Company
activities;
• other informations (e.g. history).
4. Introduction to the Thesis
Boundary value problems for my thesis:
Difficulty to estimate how many companies have a Web-Site (Coverage Level);
The Web-Site structures could have many parts no-standard (some Web-Sites
couldn’t have information about VAT number, email, etc..) ;
The updating of the data-base that contains the URLs must allow to catch the Web-
Site of a new Company and the new Web-Site of an old Company;
Some problems about privacy (e.g. email).
Relevant problems for my thesis:
Usually in the Web-Spider that
exists on the Web (e.g.
Load balancing work, efficient resources
Google), when they need to utilization;
increase the computational Scalability;
power the Company buy Costs;
other servers to provide it!!!!! Energy saving.
(General Solution)
5. Introduction to the Thesis
We want to find an answer to the relevant
problems in the “local” business case to use these
solutions for the “big” business case !!!
6. Introduction to Distributed Systems
Definition:
A distributed system consists of a collection of autonomous computers, connected through a network and
distribution middleware, which enables computers to coordinate their activities and to share the resources
of the system, so that users perceive the system as a single, integrated computing facility.
In our case we use a
distributed system to have
more computational power…
Advantages of Distributed System:
Reliability;
Sharing of resources;
Aggregate computing power;
Scalability;
7. Grid Computing, High-throughput
computing and opportunistic computing
Grid Computing: Grids are intrinsically distributed and heterogeneous but must be viewed by
the user (whether an individual or another computer) as a virtual environment with uniform
access to resources. Much of Grid software technology addresses the issues of resource
scheduling, quality of service, fault tolerance, decentralized control and security and so on, which
enable the Grid to be perceived as a single virtual platform by the user.
High-throughput computing: Opportunistic computing:
The goal of a high-throughput computing The goal of opportunistic computing is the
Environment is to provide large amounts of ability to utilize resources whenever they are
fault-tolerant computational power over available, without requiring 100% availability.
prolonged periods of time by effectively
utilizing all resources available to the network.
The two goals are naturally coupled. High-throughput computing is most
The two goals are naturally coupled. High-throughput computing is most
easily achieved through opportunistic means.
easily achieved through opportunistic means.
8. Condor
Modern processing environments that consist of large collections of workstations interconnected
by high capacity network raise the following challenging question: can we satisfy the needs of
users who need extra capacity without lowering the quality of service experienced by the owners of
under utilized workstations? . . . The Condor scheduling system is our answer to this question.
At the University of Wisconsin, Miron Livny combined his 1983 doctoral thesis on
cooperative processing with the powerful Crystal Multicomputer designed by
DeWitt, Finkel, and Solomon and the novel Remote UNIX software designed by
Litzkow. The result was Condor, a new system for distributed computing.
The goal of the Condor Project is to develop, implement, deploy, and evaluate
mechanisms and policies that support High Throughput Computing and
opportunistic computing on large collections of distributively owned computing
resources. Guided by both the technological and sociological challenges of such a
computing environment, the Condor Team has been building software tools that
enable scientists and engineers to increase their computing throughput. Condor is a
middleware that allow the users to join and use the distributed resources.
9. Condor
Condor is a specialized job and a resource management system (RMS) for
computeintensive jobs. Like other full-featured systems, Condor provides a job
management mechanism, scheduling policy, priority scheme, resource monitoring, and
resource management. Users submit their jobs to Condor, and Condor subsequently
chooses when and where to run them based upon a policy, monitors their progress, and
ultimately informs the user upon completion.
Two very important mechanisms:
ClassAds: The ClassAd mechanism in Condor provides an extremely flexible and
expressive framework for matching resource requests (e.g. jobs) with resource offers
(e.g. machines)
RemoteSystemCalls: When running jobs on remote machines, Condor can often
preserve the local execution environment via remote system calls. Remote system calls is
one of Condor’s mobile sandbox mechanisms for redirecting all of a jobs I/O-related
system calls back to the machine that submitted the job. Therefore, users do not need to
make data files available on remote workstations before Condor executes their programs
there, even in the absence of a shared file system.
10. Condor
How condor works?
This is an example [An agent (A) is shown
executing a job on a resource
(R) with the help of a matchmaker (M)]:
Step 1: The agent and the resource advertise
themselves
to the matchmaker.
Step 2: The matchmaker informs the two
parties that they are potentially
compatible.
Step 3: The agent contacts the resource and
executes a job.
This figure shows the
major processes in a
Condor system
11. Condor
What happen when you have more condor pools?
This is an example [An
agent (A) is shown executing
a job on a resource (R) via
direct flocking] :
Step 1: The agent and the
resource advertise themselves
locally.
Step 2: The agent
is unsatisfied, so it also
advertises itself to Condor
Pool B.
Step 3: The matchmaker (M)
informs
the two parties that they are
potentially compatible.
Step 4: The agent contacts the
resource and
executes a job.
12. Condor
Condor Universe:
Condor has several runtime environments (called a universe) from which to
choose. The Java Universe was the best for our project (for this first version)
so I could take advantage of portability (heterogeneous system) and it was good for
the “local” business case. A universe for Java programs was added to Condor in
late 2001. This was due to a growing community of scientific users that wished to
perform simulations and other work in Java. Although such programs might run
slower than native code, such losses were offset by faster development times and
access to larger numbers of machines.
13. Why Condor?
• We used Condor because (some motivations):
1) Efficient resource management (opportunistic computing and
high-throughput computing, ClassAds, etc..);
2) It’s a middleware for heterogeneous Distributed Systems (e.g.
we can use different types of Operative Systems);
3) It’s an open source project and It’s used in many projects in
the world like batch system;
4) Flexibility.
14. Introduction to Centralized Prototype
Architecture
Web-Sites
Customer
Customer
Data-Base
Data-Base
Identifying
information
Make Query
Index Results
Crawler
Crawler Index
Index Scorer
Scorer
New Companies, Candidates
New Web-Sites
Updater Validator
Validator Data-Base
Data-Base
Manual
Manual URLs
URLs
15. Introduction to Centralized Prototype
Architecture
Crawler:
The prototype Web-Spider must have a Crawler that make an Index of the companies Web-
Sites (e.g. UbiCrawler). This Crawler can be hired by us or we can build a new Crawler on
the basis of several products already ready (Nutch, Heritrix, Jspider, etc.). In this business
case we used the data extract throught theUbiCrawler. For indexing processes we used
Managing GigaByte (MG4J).
Consumer Data-Base:
This database contains the identifying information about the Companies: VAT number,
phone, mails, company name, sign, etc..
Scorer:
In this step there is the execution of several query and
many matchmaking processes to find the right “match” between
identifying information and the companies Web-Sites. Each match
will have a score.
16. Centralized Scorer
Class Diagram - These are the most important classes,
where we can see the principal processes of the Web-
Spider (together with the indexing processes)
17. Centralized Scorer
Into the Centralized Scorer we have the following activities:
Score
Query Query Query Check Check
over the over the over over over
Phone and address the the the
VAT other URL type
Number fields name page
All these activities are completed in about 5 seconds (average), so
to complete the analysis of a Company you need to wait this time.
If you have to analysee 56.000 company you have to wait about
280.000 seconds!!!
There is a big problem: the number of the Companies can be
very high !!!!
18. Centralized Scorer
We can glance at the java code that implements some functions:
AssociaDomini constructor
In this Class is implemented principally the logic that
allows the “match” between the identifying
information and the companies’ Web-Sites.
20. Centralized Scorer
3/3 How the method called
associa() record the results
on a log file
We preferred to use hibernate
because it’s an open source
java persistence framework
project. Perform powerfull
object relational mapping.
21. Results Achieved
On a sample of 56000 companies:
Query Coverag Coverag Phone and VAT Number:
e (#) e (%) These types of query are very good for the coverage and for
the reliability.
Sign 2747 4,43%
Phone 25715 41,47% Sign:
Low coverage
VAT 4369 7,05% Query Precision
Number Company Name: (%)
Very good coverage but low precision
Compan 27487 44,33% Sign 1%
y Name Phone 25%
VAT 55%
How many companies can you cover with Number
these queries? Company 3%
What precision can you achieve? Name
22. Results Achieved
For a sample of 56000 companies:
Trend (S) 1 Personal Computer works for 77h (only
300000
for this computation)
250000
200000 Personal Computer used:
150000 Computer Desktop, Intel Dual Core 2,4 GHZ, 2 GB
100000 Trend (S) di Ram e 1 TB di HDisk
50000
Possible Problems:
0 the personal computer goes down;
500 Companies 1000 Companies 10000 56000 there are new Companies (updating) or some
Companies Companies
Web-Sites are changed, in this case the computation
must continue…
The matchmaking processes For 1.000.000 companies that you have to
and indexing processes are analyse:
frequent in the time!!!!
1 Personal Computer works about for 1389
days. It’s an ideal case…
This isn’t a scalable solution!
23. A possible solution:
Distributed Scorer
We want to make a scalable solution for our Web-Spider.
There are some important constraints that we have to respect:
1) Energy Saving;
2) Efficient resources management and efficient resources utilization;
3) Cost cutting;
4) Having more companies analysed in a long time;
We can submit each set
of queries on a different
computer !!!
24. Distributed Scorer
We built a distributed scorer using the Condor middleware.
This is a possible architecture
where execute our Distributed
Scorer.
Example of
architecture used by
the National Institute
of Nuclear Physics
25. Distributed Scorer
We built a wrapper class to prepare
We used the the work environment on Condor.
vertical This class realize the logic
distribution and connection between the application
the horizontal and Condor. This class is runned on
distribution. the Server (Central Manager).
27. Distributed Scorer
We can see some tests on Condor for our application:
Some examples about Submit Description Files, these files are used by Condor for the
matchmaking processes between the resources and the Jobs.
This is our Condor Pool
during the tests
28. Distributed Scorer
Our application submit the jobs… We can check the status for our jobs…
If we have more jobs… we can check the
status for our resources… We can check the status for our jobs…
Now, we have to check which results we achieved with this
Distributed Scorer!! What is better?
29. New Results Achieved
(1) We have an excellent
work load balancing and
efficient resources
utilization…
7000000 (2) We can see how is possible increase the
6000000
number of computations in a period of time
(using High-throughput computing).
5000000
It works even better if we have a sample of
4000000 1 PC Companies higher. (+ Scalability!)
3000000 50 PCs 3000000
500 PCs
2000000 2500000
Marginal
1000000 2000000 Seconds for 56000
companies
Gain
1500000
0 Seconds for 250000
companies
10 hours 50 hours 500 hours 1000000
Seconds for 500000
companies
500000
0
1 PC 50 PCs 500 PCs
30. New Results Achieved
We can think to use the internet user’s machine when they are in an inactive mode… or
we can use the companies’ machines because they can use our web-spider for direct
marketing…
Make profit with your idle CPU cycles!
1.200.000 €
1.000.000 €
800.000 €
Energy cost for the Company
(every year) (with owner
You can economize much
600.000 €
machines) money and much energy
400.000 € How does the energy cost
increase in a year? (using
saving!!!
200.000 € users' machines)
0€
1 Macchina 50 Macchine 500
Macchine
31. From “local” business case to the
big business case…
The Googleplex is the corporate
headquarters complex of Google, Inc.,
located at 1600 Amphitheatre Parkway
in Mountain View, Santa Clara County,
California, near San Jose.
Google purchased some of Silicon
Graphics' properties, including the
Googleplex, for $319 million.
In late 2006 and early 2007 the company
installed a series of solar panels, capable
of producing 1.6 megawatts of
electricity. At the time, it was believed to
be the largest corporate installation in
the United States. About 30 percent of
the Googleplex's electricity needs will be
fulfilled by this project, with the
remainder being purchased.