USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
Web Search 101
1. Web Search 101
Finding Lesson Plans, Activities, Songs, Games, and
Conducting Serious Academic Research
MADE EASIER, FASTER AND MORE ACCURATE
Developed By
William Tweedie
2. October 2011 & 2012
Table of Contents
Preface....................................................................................................................... 4
Objectives .................................................................................................................. 5
Materials: ................................................................................................................... 5
Timing: ....................................................................................................................... 5
Procedure................................................................................................................... 6
Part 1 – The Surface Web, Search Engines and Directories...................................... 6
A. Activating Prior Knowledge..................................................................................... 6
B. Search Engine – An online (Internet) World Wide Web search program................7
D. Search Queries ..................................................................................................... 8
FRAMING YOUR SEARCH STRATEGY.................................................................... 8
ACTIVITY:.................................................................................................................. 9
E. Basic Boolean Search Operators (AND, OR, NOT).............................................. 10
F. Search Tips, Tricks and Techniques..................................................................... 10
G. Wrap-up of Part 1................................................................................................. 10
Part 2 – The Hidden Web......................................................................................... 10
The Internet, World Wide Web and the Hidden Web................................................ 11
Scratching the Surface and Digging Deep – Layers of the Web............................ 12
Education.............................................................................................................. 14
Three Types of Search Engines .............................................................................. 18
Crawler-based search engines ............................................................................. 18
Human-powered directories ................................................................................. 19
Hybrid search engines ......................................................................................... 20
Table of Search Engine Features ......................................................................... 20
How do Search Engines Work?............................................................................ 22
Table of Directory Features................................................................................... 23
Subject Directories (Contain Databases), and Portals ......................................... 24
How to Find Subject-Focused Directories for a Specific Topic, Discipline, or Field
.............................................................................................................................. 24
What Are "Meta-Search" Engines? How Do They Work? ..................................... 25
Are "Smarter" Meta-Searchers Still Smarter?....................................................... 25
Better Meta-Searchers.......................................................................................... 25
2
3. Meta-Search Engines for SERIOUS Deep Digging .............................................. 26
Search Basics: Constructing a Google Query .......................................................... 26
Where does the term Boolean originate from?...................................................... 27
Is Boolean Search Complicated?.......................................................................... 27
Boolean Search And / Or / Not.............................................................................. 27
Boolean Search Examples Boolean Connectors:.................................................. 28
Interactive Text Equivalent.................................................................................... 28
How the Search Engines Differ............................................................................. 30
Search Engine Syntax & Features Comparison Chart ......................................... 30
Some Search Tips, Tricks, & Techniques ............................................................ 33
Invisible or Deep Web: What it is, How to find it, and its inherent ambiguity.........34
Why isn't everything visible?................................................................................. 34
How to Find the Invisible Web .............................................................................. 35
The Ambiguity Inherent in the Invisible Web: ....................................................... 35
Want to learn more about the Invisible Web?........................................................ 35
10 Search Engines to Explore the Invisible Web................................................... 36
How do we get to this mother lode of information?................................................ 36
The Invisible Web Databases................................................................................... 41
Dictionaries, Translators, & Other Language & Reference Tools ............................. 44
Web directories ........................................................................................................ 48
Internet Gateways, Jumplists, & Specialized Link Collections................................... 48
Finding Jumplists & Gateways.............................................................................. 49
www.invisible-web.net.............................................................................................. 49
Saving pages with Microsoft Internet Explorer ..................................................... 50
Peer-to-Peer Computing ...................................................................................... 50
Education ............................................................................................................. 50
Subject-orientated search services....................................................................... 52
Additional information about search engines, their use, and how they find
resources.............................................................................................................. 52
Data services requiring registration ...................................................................... 52
Data services with unrestricted access................................................................. 54
Search Engines .................................................................................................... 55
Subject-orientated search services....................................................................... 56
Dictionaries and Thesauri .................................................................................... 57
Reference Works ................................................................................................. 58
General Tips for Searching the Web......................................................................... 60
3
4. Carefully Select Your Search Terms..................................................................... 60
Framing your search strategy............................................................................... 60
International Educational Research Links................................................................. 62
Education databases................................................................................................ 64
Teaching websites.................................................................................................... 64
Journals.................................................................................................................... 65
Newsletters............................................................................................................... 65
New Educational Technology Standards for Teachers and Students.......................65
NETS for Teachers 2008...................................................................................... 65
NETS for Students 2007....................................................................................... 67
Glossary ............................................................................................................... 69
A to Z Computer/Internet Terms............................................................................ 69
Appendix A............................................................................................................... 74
Preface
The Internet and its World Wide Web are growing, developing and adding new
features at an explosive exponential rate. As you read this there are new
technologies being developed and implemented to make ‘surfing’ the Internet for
useful information of all types easier and more accurate, from the traditional
document to flash videos and file types previously inaccessible These types of pages
used to be invisible but can now be found in most search engine results:
• Pages in non-HTML formats (pdf, Word, Excel, PowerPoint), now converted
into HTML.
• Script-based pages, whose URLs contain a ? or other script coding.
4
5. • Pages generated dynamically by other types of database software (e.g.,
Active Server Pages, Cold Fusion). These can be indexed if there is a stable
URL somewhere that search engine crawlers can find.
The "visible web" is what you can find using general web search engines. It's also
what you see in almost all subject directories. The "invisible web" is what you
cannot find using these types of tools.
Search engines' crawlers and indexing programs have overcome many of the
technical barriers that made it impossible for them to find "invisible" web pages.
Computer robot programs, referred to sometimes as "crawlers" or "knowledge-bots"
or "knowbots" that are used by search engines to roam the World Wide Web via the
Internet, visit sites and databases, and keep the search engine database of web
pages up to date. They obtain new pages, update known pages, and delete obsolete
ones. Their findings are then integrated into the "home" database. Most large search
engines operate several robots all the time. Even so, the Web is so enormous that it
can take six months for spiders to cover it, resulting in a certain degree of "out-of-
datedness" (link rot) in all the search engines.
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Glossary.html
Therefore this is truly just a starting point for the serious researcher whether in
academia or as a consumer of goods and services.
Objectives
In this brief overview we will look at and explore the elements that make for effective
research on the Internet.
1. You will learn the Internet is composed of the “Surface Web” and the “Deep or
Hidden Web.
2. You will learn how to access information on both in the most expedient way
through Search Engines, Meta-search engines and other Internet tools.
a. You will learn what Search Engines are and the various types
available.
b. You will learn what Subject Directories, Portals, and Databases are.
3. You will learn how to construct a search strategy.
4. You will learn the basics of Boolean parameters which narrow search results.
5. You will be provided special resources for academic research.
Materials:
This workshop needs to be conducted in a computer lab with very good Internet
access. Participants will follow specific areas of this reference book throughout the
workshop.
These areas can be changed according to the needs of the group. This reference
book is as comprehensive a guide as possible at the time of production.
Timing:
5
6. This workshop is designed to give a brief introduction to the complex world of the
‘Surface’ and ‘Hidden’ Webs with a focus on helping make searches more effective
and productive. Normal time allotted is 2 hours but it can be extended according to
time availability and the group’s level of expertise and interest. It is fully expected that
participants will regularly refer to this book and refine their search skills
independently.
DISCLAIMER: Changes on the Internet and in the Hidden Web occur at a rapid pace
so some of the search engines, sites, directories and databases may no longer be
available at the web addresses provided and some may no longer exist. Be prepared
to move quickly to the next point of interest. Broken links and inaccessible web-sites
can be researched at a later date.
Procedure
It is preferable to distribute this reference book well in advance of the workshop so
participants can familiarize themselves with the terms, content, and explore a few of
the sites.
Part 1 – The Surface Web, Search Engines and Directories
A. Activating Prior Knowledge
ACTIVITY: PRIME TASK: Q & A
1. The Surface Web (WWW) – What is it composed of?
6
7. Write as many types of information or components of the World Wide Web as you
can.
Time: 10 minutes
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
2. How can you access this information?
Write as many ways as you can?
Time: 10 minutes
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
3. How many Search Engines can you name? What is your favorite search engine?
Do you use more than one?
Write your answers.
Time: 5 minutes
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
4. How often do you use a search engine in a day? Week? What do you search for?
How long do you spend per search? Do you get the results you need or want?
Write your answers.
Time: 5 minutes
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
___________________________________________________________________
B. Search Engine – An online (Internet) World Wide Web search program.
7
8. 1. There are 3 types of Search Engine:
a). Crawler-based (e.g. Google) – these create their listings automatically through
special programs that crawl or spider the web which follow links in web pages it
already has to its collection of sources, retrieve information found in index servers of
web-sites (containing key words) then send it back to the engine’s doc servers which
retrieve the entire document and create snippets to describe the document and which
contain the key words that might be the subject of a search query. – Very fast.
b). Human Powered Directories (e.g. (Open Directory Project) – gets its information
from visitor submissions which include a short description which is the source of any
key words in a search. – Also fast.
c). Hybrid Engines – combine results from the first two though one engine may have
a preference over the other. – Depends on the engine.
Search engines rely on their own ‘cache’ of web pages they have harvested but
when accessed (clicked) you are taken to the source’s latest page. If a page is never
linked it cannot be indexed.
The pages indexed are visible pages only. We’ll look at the Invisible web in Part 2.
2. How many search engines do you think there are? 80% of web pages in a major
search engine exist only on that engine; so, it is worth taking a look at some of the
others for a ‘second opinion’.
ACTIVITY:
Chose a topic and search for it on Google or www.DuckDuckGo.com. Then do the
same search on Exalead (www.exalead.com/search/. Compare the number of results
and the sources of these results.
Time: 20 minutes
C. Meta – search Engines – combine the results of many search engines.
(www.dogpile.com), (www.surfwax.com)
ACTIVITY:
Use the same search term as in the previous activity and compare the results again.
Time: 10 minutes
D. Search Queries
FRAMING YOUR SEARCH STRATEGY
To get a successful search result, you must ask the right search question. Framing a
good question requires you to think strategically about exactly what you need.
"By taking the time to identify key phrases and visualize the ideal answer, you will be
more likely to recognize that answer when you find it online." (Nora Paul)
Her guidelines are based on the standard journalist approach of "who, what, when,
where, why and how" reporting and include these tips, among others:
8
9. Who:
• Who is the research about: a politician, a businessperson, a scientist, a
criminal?
• Who is key to the topic you are researching? Are there any recognized
experts or spokespersons you should know about?
What:
• What kind of information do you need: statistics, sources, background?
• What kind of research are you doing: an analysis, a background report, a
follow-up?
• What would the ideal answer look like?
When:
• When did the event being researched take place? This will help determine the
source to use, particularly, which information source has resources dating far
enough back.
• Do you know when you should stop searching?
Where:
• Where did the event you are researching take place?
• Where have you already looked for information?
• Where might there have been previous coverage: newspapers, broadcasts,
trade publications, court proceedings, discussions?
Why:
• Why do you need the research: seeking a source to interview, surveying a
broad topic, pinpointing a fact?
• Why must you have the research: to make a decision, to corroborate a
premise?
How:
• How much information do you need: a few good articles for background,
everything in existence on the topic, just the specific fact?
• How are you going to use the information: for an anecdote, for publication?
"Today," Schlein says, "so much data is available that, without a plan, you can easily
find yourself swimming in an ocean of information…A good, clear question will save
you hours of work." Find Paul's complete checklist and other good search
suggestions from Schlein in Find It Online (Tempe, AZ: Facts on Demand Press,
2004).
ACTIVITY:
Reframe the above criteria for research on an academic topic.
Time: 15 minutes
9
10. E. Basic Boolean Search Operators (AND, OR, NOT)
ACTIVITY:
Complete the 4 activities on the “Boolify” worksheets.
Time: 30 minutes
See Appendix A
F. Search Tips, Tricks and Techniques
See page 25 below
Time: 5 minutes
G. Wrap-up of Part 1
Reflection and Feedback
Part 2 – The Hidden Web
Look at 10 Search Engines to Explore the Invisible Web on pages 28 – 33
Experiment and explore some of the Web Portals, Directories and Databases
A. List the Categories you find in each
B. Try Boolean searching for a specific topic you currently are researching for a
paper or lesson
Time: 1 hour
10
11. You may make notes below:
The Internet, World Wide Web and the Hidden Web
The Internet is a network of computers connected together ('External net') to share
information with others through means of the World Wide Web (WWW).
World Wide Web (WWW) is part of the Internet where text and graphics are placed
together and where information can be easily accessed and shared with others to
form a Web Page along with links to different documents or other places (Hypertext
or Hyperlinks).
- From the Glossary Section at the end of this reference book
11
12. The World Wide Web is also known as the ‘Surface Web’ – available to anyone who
has a computer and internet connection.
Scratching the Surface and Digging Deep – Layers of the Web
"The Invisible Web"
By Chris Sherman
There's a big problem with most search engines, and it's one many people aren't
even aware of. The problem is that vast expanses of the Web are completely
invisible to general purpose search engines like AltaVista, HotBot and Google. Even
worse, this "Invisible Web" is in all likelihood growing significantly faster than the
visible Web you're familiar with.
So what is this Invisible Web and why aren't search engines indexing it? To answer
this question, it's important to first define the "visible" Web, and describe how search
engines compile their indexes.
The Web was created a little over twenty-two years ago by Tim Berners-Lee, a
researcher at the European Organization for Nuclear Research CERN -The name is
derived from the acronym for the French Conseil Européen pour la Recherche
Nucléaire a high-energy physics laboratory in Switzerland.
Berners-Lee designed the Web to be platform-independent, so that researchers at
CERN could share materials residing on any type of computer system, avoiding
cumbersome and potentially costly conversion issues. To enable this cross-platform
capability, Berners-Lee created HTML, or HyperText Markup Language - essentially
a dramatically simplified version of SGML (Standard Generalized Markup Language).
HTML documents are simple: they consist of a "head" portion, with a title and
perhaps some additional meta-data describing the document, and a "body" portion,
the actual document itself. The simplicity of this format makes it easy for search
engines to retrieve HTML documents, index every word on every page, and store
them in huge databases that can be searched on demand.
What's less easy is the task of actually finding all the pages on the Web. Search
engines use automated programs called spiders or robots to "crawl" the Web and
retrieve pages. Spiders function much like a hyper-caffeinated Web browser - they
rely on links to take them from page to page.
Crawling is a resource-intensive operation. It also puts a certain amount of demand
on the host computers being crawled. For these reasons, search engines will often
limit the number of pages they retrieve and index from any given Web site. It's
tempting to think that these unretrieved pages are part of the Invisible Web, but they
aren't. They are visible and indexable, but the search engines have made a
conscious decision not to index them.
In recent months, much has been made of these overlooked pages. Many of the
major engines are making serious efforts to include them and make their indexes
more comprehensive. Unfortunately, the engines have also discovered through their
"deep crawls" that there's a tremendous amount of duplication and spam on the Web.
Current estimates put the Web at about 1.2 to 1.5 billion indexable pages. Both
Inktomi and AltaVista have claimed that they've spidered most of these documents,
but have been forced to cull their indexes to cope with duplicates and spam. Inktomi
12
13. puts the size of the distilled Web at about 500 million pages; AltaVista at about 350
million.
But these numbers don't include Web pages that can't be indexed, or information
that's available via the Web but isn't accessible by the search engines. This is the
stuff of the Invisible Web.
Why can't some pages be indexed? The most basic reason is that there are no links
pointing to a page that a search engine spider can follow. Or, a page may be made
up of data types that search engines don't index - graphics, CGI scripts, Macromedia
flash or PDF files, for example.
But the biggest part of the Invisible Web is made up of information stored in
databases. When an indexing spider comes across a database, it's as if it has run
smack into the entrance of a massive library with securely bolted doors. Spiders can
record the library's address, but can tell you nothing about the books, magazines or
other documents it contains.
There are thousands - perhaps millions - of databases containing high-quality
information that are accessible via the Web. But in order to search them, you
typically must visit the Web site that provides an interface to the database. The
advantage to this direct approach is that you can use search tools that were
specifically designed to retrieve the best results from the database. The
disadvantage is that you need to find the database in the first place, a task the
search engines may or may not be able to help you with.
Another problem is that content in some databases isn't designed to be directly
searchable. Instead, Web developers are taking advantage of database technology
to offer customized content that's often assembled on the fly. Search engine results
pages are an example of this type of dynamically generated content - so are services
like My Excite and My Yahoo. As Web sites get more complex and users demand
more personalization, this trend toward dynamically generated content will
accelerate, making it even harder for search engines to create comprehensive Web
indexes.
In a nutshell, the Invisible Web is made up of unindexable content that search
engines either can't or won't index. It's a huge part of the Web, and it's growing.
Fortunately, there are several reasonably thorough guides to the Invisible Web.
Gary Price, Reference Librarian at the Gelman Library at George Washington
University, is considered one of the foremost authorities on online databases and
other invaluable search resources on the Invisible Web.
http://www.resourceshelf.com/
Price's List of Lists (LOL) was started around 1998 and maintained by Gary Price for
many years. The LOL grew, and Gary's commitment to other projects and speaking
engagements made the upkeep of the LOL impossible. In late 2000, Gary
approached Trip Wyckoff, of Specialissues.com, about taking over the upkeep and
expansion of the LOL. By 2002 the online database and structure to maintain and
organize the LOL was in place and in October 2002 the LOL was transferred to
www.Specialissues.com.
"By the way, do not mistake an interest in the Invisible Web as a slam on the general
search engines because it is NOT," says Price. "General search tools are still 100%
essential for accessing material on the Internet."
13
14. One of the largest gateways to the Invisible Web is the aptly named Invisibleweb.com
<http://www.invisibleweb.com> from Intelliseek.
"Invisible Web sources are critical because they provide users with specific, targeted
information, not just static text or HTML pages," says Sundar Kadayam, CTO and
Co-Founder, Intelliseek.
"InvisibleWeb.com is a Yahoo-like directory. It is a high quality, human edited and
indexed, collection of highly targeted databases that contain specific answers to
specific questions," says Kadayam.
Intelliseek also makes BullsEye, a desktop based metasearch engine that can also
access many of the sites included in InvisibleWeb.com. More information can be
found at <http://www.intelliseek.com/prod/bullseye.htm>.
A good librarian would not start looking for a phone number (specialized, Invisible
Web info) by searching the Encyclopaedia Britannica (general knowledge resource),"
says Price. "Both professional and casual searchers should at least be aware that
they could be missing some information or wasting time finding what could be found
more easily if the right tool for the job is easily accessible. This is very similar to a
good reference librarian “knowing' the major reference tools in his or her collection.
Chris Sherman is the Web Search Guide for About.com.
- Extracted from http://web.freepint.com/go/newsletter/64
Gary Price's List of Lists
Agriculture, Forestry, Fishing and Hunting, Petroleum & Mining, Utilities,
Construction, Manufacturing, Wholesale Trade, Retail Trade, Transportation and
Warehousing Information, Finance & Insurance, Real Estate Rental & Leasing,
Professional, Scientific, and Technical Services, Business & Industry Management,
Administrative & Support Services, Education, Health Care and Social Assistance
Arts, Entertainment and Recreation, Accommodation and Food Services, Repairs,
Religious, Civic, Professional, and Similar Organizations, Public Administration &
Public Works, Country/Region Specific, Executives…
- extracted from http://www.specialissues.com/lol/
Education
Magazine Article Year
American School & Top 10 Issue (biggest, best and most popular in education
2005
University Magazine facilities and business)
American School & Top 10 Issue (biggest, best and most popular in education
2003
University Magazine facilities and business)
14
15. American School & Top 100 School Districts and Colleges Facilities (ranked
2003
University Magazine by size of facilities)
American School & Top 10 Issue (biggest, best and most popular in education
2004
University Magazine facilities and business)
American School & Top 100 School Districts and Colleges Facilities (ranked
2004
University Magazine by size of facilities)
American School & Top 100 School Districts and Colleges Facilities (ranked
2002
University Magazine by size of facilities)
American School & Top 10 Issue (biggest, best and most popular in education
2006
University Magazine facilities construction, operations and management)
American School & Top 100 School Districts and Colleges Facilities (ranked
2006
University Magazine by size of facilities)
Business Week
(Global edition) Best Business Schools (ranking and review of the world's
2002
(formerly North leading business schools) (1986)
America edition)
Business Week
(Global edition) Best Executive Education/Business Schools (ranking and
2005
(formerly North review of the world's leading business schools) (1986)
America edition)
Business Week
(Global edition) Best Executive Education/Business Schools (ranking and
2004
(formerly North review of the world's leading business schools) (1986)
America edition)
Business Week
(Global edition)
Young Professionals: Best Undergrad B-Schools 2007
(formerly North
America edition)
Business Week
(Global edition)
Young Professionals: Best Undergrad B-Schools 2008
(formerly North
America edition)
MBA Report (annual look at master of business
administration education, we've decided to forgo our
Canadian Business traditional ranking of Canada's MBA programs and instead 2003
examine the ever-increasing variety of choices Canadian
schools are offering) (1991)
Chief Executive Annual Best Business Schools for Executive Education 2006
15
16. (2004)
Almanac of Higher Education (statistical/demographic
Chronicle of Higher databook on education covering four major topical areas:
2002
Education, The students, faculty and staff, resources, and institutions)
(separate issue)
Expansion
Metro With the Best Public Education Systems 2005
Management
College Census (2001 performance report for 100 top self-
Foodservice Director 2002
op colleges)
School Census (performance report for top 100 school
Foodservice Director 2002
districts)
Best Business Schools (ranked by return on investment)
Forbes 2008
(2001, biennial)
Best Business Schools (ranked by return on investment)
Forbes 2007
(2001, biennial)
Fortune Top 50 MBA Employers 2007
Fortune
(International
Version: Asia, 20 Great Employers for New Grads 2007
Europe, Latin
America)
Fortune Small
10 Cool Colleges for Entrepreneurs 2006
Business: FSB
Fortune Small
Best Colleges for Entrepreneurs 2007
Business: FSB
Maclean's Canada's Best Schools 2004
Maclean's Annual University Ranking (1990) 2004
Scholastic Top 10 (top 10 universities ranked by the
quality and variety of workshops, conferences and short
Meat & Poultry 2004
courses available at universities throughout the U.S.)
(2000)
Top 10 Universities (top 10 universities ranked by the
quality and variety of workshops, conferences and short
Meat & Poultry 2007
courses available at universities throughout the U.S.)
(2000)
16
17. National Law JournalNLJ Law Schools Report 2008
Progress Magazine
The High School Report Card (the AIMS Ranking of High
(CA) (formerly
School Performance in Every District in Atlantic Canada 2009
Atlantic Progress
and Maine) (2002)
Magazine)
Quirk's Marketing
University Degree Programs in Marketing Research 2008
Research Review
School Bus Fleet Statistics & Top Rankings 2003
School Bus Fleet Top 50 Contractor Fleets 2002
School Bus Fleet Top 100 School District Fleets 2002
School Planning &
Leading the Way: America's Fastest Growing Districts 2007
Management
Technology Review University Research Scorecard (ranking and analysis of
(formerly MIT intellectual property and research revenues and spin-offs, 2002
Technology Review) includes profiles of hot start-ups)
U.S. News and
Best Graduate Schools Guide 2002
World Report
U.S. News and
America's Best Colleges Guide 2002
World Report
U.S. News and
Colleges (1,400+ schools) 2002
World Report
U.S. News and
Community Colleges (1,200+ schools) 2002
World Report
U.S. News and
Corporate E-learning vendors (600+ providers) 2002
World Report
U.S. News and
E-learning courses and degrees (1,000+ institutions) 2002
World Report
U.S. News and
Graduate Schools (1,000+ programs) 2002
World Report
U.S. News and
Scholarships (600,000+ awards) 2002
World Report
U.S. News and
Best Graduate Schools 2005
World Report
U.S. News and Best Colleges 2004
17
18. World Report
Virginia Business Special Report: Business Schools Directory 2006
Virginia Business Private Schools Directory 2006
Virginia Business Special Report: Community Colleges Directory 2006
Virginia Business Education: Engineering/IT Schools Directory 2006
Three Types of Search Engines
The term "search engine" is often used generically to describe crawler-based search
engines, human-powered directories, and hybrid search engines. These types of
search engines gather their listings in different ways, through crawler-based
searches, human-powered directories, and hybrid searches.
Crawler-based search engines
Crawler-based search engines, such as Google (http://www.google.com), create their
listings automatically. They "crawl" or "spider" the web, then people search through
what they have found. If web pages are changed, crawler-based search engines
eventually find these changes, and that can affect how those pages are listed. Page
titles, body copy and other elements all play a role.
The life span of a typical web query normally lasts less than half a second, yet
involves a number of different steps that must be completed before results can be
delivered to a person seeking information. The following graphic (Figure 1) illustrates
this life span (from http://www.google.com/corporate/tech.html):
18
19. 1. The web server sends the query to the index
3. The search results are servers. The content inside the index servers is
returned to the user in a similar to the index in the back of a book - it
fraction of a second. tells which pages contain the words that match
the query.
2. The query travels to the doc
servers, which actually retrieve
the stored documents.
Snippets are generated to
describe each search result.
Human-powered directories
A human-powered directory, such as the Open Directory Project
(http://www.dmoz.org/about.html) depends on humans for its listings. (Yahoo!, which
used to be a directory, now gets its information from the use of crawlers.) A directory
gets its information from submissions, which include a short description to the
directory for the entire site, or from editors who write one for sites they review. A
search looks for matches only in the descriptions submitted. Changing web pages,
therefore, has no effect on how they are listed. Techniques that are useful for
improving a listing with a search engine have nothing to do with improving a listing in
a directory. The only exception is that a good site, with good content, might be more
likely to get reviewed for free than a poor site.
19
20. Hybrid search engines
Today, it is extremely common for crawler-type and human-powered results to be
combined when conducting a search. Usually, a hybrid search engine will favor one
type of listings over another. For example, MSN Search (http://www.imagine-
msn.com/search/tour/moreprecise.aspx) is more likely to present human-powered
listings from LookSmart (http://search.looksmart.com/). However, it also presents
crawler-based results, especially for more obscure queries.
Recommended Search Engines
UC Berkeley - Teaching Library Internet Workshops
Google is currently the most used search engine. It has one of the largest databases
of Web pages, including many other types of web documents (blog posts, wiki pages,
group discussion threads and document formats (e.g., PDFs, Word or Excel
documents, PowerPoints). Despite the presence of all these formats, Google's
popularity ranking often places worthwhile pages near the top of search results.
Google alone is not always sufficient, however. Not everything on the Web is fully
searchable in Google. Overlap studies show that more than 80% of the pages in a
major search engine's database exist only in that database. For this reason, getting a
"second opinion" can be worth your time. For this purpose, we recommend Yahoo!
Search or Exalead. We do not recommend using meta-search engines as your
primary search tool.
Table of Search Engine Features
Some common techniques will work in any search engine. However, in this very
competitive industry, search engines also strive to offer unique features. When in
doubt, look for "help", "FAQ", or "about" links.
Search Google Yahoo! Search Exalead
Engine www.google.com search.yahoo.com www.exalead.com/search/
Links to Google help Yahoo! help Exalead help and FAQ
help
Size, type IMMENSE. Size not HUGE. Claims over LARGE. Claims to have
disclosed in any way 20 billion total "web over 8 billion searchable
that allows objects." pages.
comparison. Probably
the biggest.
Noteworthy PageRank™ system Shortcuts give Truncation lets you search
features includes hundreds of quick access to by the first few letters of a
factors, emphasizing dictionary, word.
pages most heavily synonyms, patents, Proximity search lets you
linked from other traffic, stocks, find terms NEAR each
pages. encyclopedia, and other or NEXT to each
20
21. Many additional more. other.
databases including Thumbnail page previews.
Book Search, Scholar Extensive options for
(journal articles), Blog refining and limiting your
Search, Patents, search.
Images, etc.
Phrase Enclose phrase in Enclose phrase in Enclose phrase in "double
searching "double quotes". "double quotes". quotes".
Boolean Partial. AND assumed Accepts AND, OR, Partial. AND assumed
logic between words. NOT or AND NOT. between words.
Capitalize OR. Must be Capitalize OR.
( ) accepted but not capitalized. ( ) accepted.
required. ( ) accepted but not See Web Search Syntax
In Advanced Search, required. for more options.
partial Boolean
available in boxes.
+Requires/ - excludes - excludes - excludes
-Excludes + retrieves "stop + will allow you to + retrieves "stop words"
words" (e.g., +in) search common (e.g., +in)
words: "+in truth"
Sub- The search box at the The search box at The search box at the top
Searching top of the results page the top of the of the results page shows
shows your current results page shows your current search. Modify
search. Modify this your current this (e.g., add more terms
(e.g., add more terms search. Modify this at the end.)
at the end.) (e.g., add more
terms at the end.)
Results Based on page Automatic Fuzzy Popularity ranking
Ranking popularity measured AND. emphasizes pages most
in links to it from other heavily linked from other
pages: high rank if a pages.
lot of other pages link
to it.
Fuzzy AND also
invoked.
Matching and ranking
based on "cached"
version of pages that
may not be the most
recent version.
Field link: link: intitle:
21
22. limiting site: site: inurl:
intitle: intitle: site:
inurl: inurl: after:[time period]
Offers U.S.Gov't url: before:[time period]
Search and other hostname: (For details, click on
special searches. (Explanation of "Advanced search")
Patent search. these distinctions.)
Truncation, No truncation within Neither. Search Use *
Stemming words. Automatically with OR as in example: messag*
) stems some words. Google.
Search variant
endings and
synonyms separately,
separating with OR
(capitalized):
airline OR airlines
Use * or _ as
wildcards substituting
for initials or words:
sickle * anemia
george _ bush
Language Yes. Major Yes. Major Extensive language and
Romanized and non- Romanized and geographic options. Use
Romanized languages non-Romanized "Advanced Search".
in Advanced Search. languages.
Translation Yes, in "Translate this Available as a Yes, in "Translate this
page" link following separate service. page" link following some
some pages. To and pages.
sometimes from
English and major
European languages
and Chinese,
Japanese, Korean.
Ues its own translation
software with user
feedback.
How do Search Engines Work?
Search engines do not really search the World Wide Web directly. Each one
searches a database of web pages that it has harvested and cached. When you use
a search engine, you are always searching a somewhat stale copy of the real web
page. When you click on links provided in a search engine's search results, you
retrieve the current version of the page.
22
23. Search engine databases are selected and built by computer robot programs called
spiders. These "crawl" the web, finding pages for potential inclusion by following the
links in the pages they already have in their database. They cannot use imagination
or enter terms in search boxes that they find on the web.
If a web page is never linked from any other page, search engine spiders cannot find
it. The only way a brand new page can get into a search engine is for other pages to
link to it, or for a human to submit its URL for inclusion. All major search engines offer
ways to do this.
After spiders find pages, they pass them on to another computer program for
"indexing." This program identifies the text, links, and other content in the page and
stores it in the search engine database's files so that the database can be searched
by keyword and whatever more advanced approaches are offered, and the page will
be found if your search matches its content.
Many web pages are excluded from most search engines by policy. The contents of
most of the searchable databases mounted on the web, such as library catalogs and
article databases, are excluded because search engine spiders cannot access them.
All this material is referred to as the "Invisible Web" -- what you don't see in search
engine results.
Recommended Subject Directories
UC Berkeley - Teaching Library Internet Workshops
- extracted from
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/SubjDirectories.html
Recommended General Subject Directories:
Table of Directory Features
Web ipl2 Infomine About.com Yahoo!
Directories www.ipl.org infomine.ucr.edu www.about.com dir.yahoo.com
Size, type Over 40,000. Over 125,000. Over 2 million. About 4 million.
Highest quality Useful, reliable Generally good Very short
sites only. annotations. annotations done descriptions.
Useful, reliable Compiled by by "Guides" with Often useful,
annotations. academic librarians various levels of especially for
Formed by a from the University expertise. popular and
merger of the of California and commercial
Librarians' elsewhere. topics.
Internet Index
and the
Internet Public
Library.
Phrase No. Yes. Use " " Yes. Use " " Yes. Use " "
searching |term term| requires
exact match
Boolean OR implied AND implied No. Yes, as in
23
24. logic between between words. Yahoo! Search
words. Also Also accepts OR, web search
accepts AND NOT, and ( ). engine.
and NOT.
Nesting with (
) does not
work.
Truncation No. Use *. Also stems. Use *. No.
) Can turn stemming Not accepted
off. Use " " or | | to consistently.
search exact terms.
Field No. Limit to Author, No. As in Yahoo!
searching Title, Subject, Search web
Keyword, search engine.
Description, and
more.
Subject Directories (Contain Databases), and Portals
How to Find Subject-Focused Directories for a Specific Topic, Discipline, or
Field
There are thousands of specialized directories on practically every subject. If you
want an overview, or if you feel you've searched long enough, try to find one. Often
they are done by experts -- self-proclaimed or heavily credentialed. Here are some
ways to find them:
Use any of the Subject Directories above to find more specific directories. Here are
some tips:
• In ipl2 or Infomine, look for your subject as you would for any other purpose,
and keep your eyes open for sites that look like directories. Read through the
descriptions. Sometimes these resources are identified as "Directories,
"Virtual Libraries," or "Gateway Pages."
• In About.com (A Portal which is a site that links to many other sites according
to its site construction or Directory) or Yahoo! directory, try adding the terms
web directories to your subject keyword term:
EXAMPLES:
civil war web directories
weddings web directories
• In About.com, search by topic and look for pages that are described as "101"
or "guides" or a "directory." About.com is written by "Guides" who,
themselves, often are experts in the sections they manage. Sometimes they
write excellent overviews of a topic.
24
25. Meta-Search Engines
UC Berkeley - Teaching Library Internet Workshops
What Are "Meta-Search" Engines? How Do They Work?
In a meta-search engine, you submit keywords in its search box, and it transmits your
search simultaneously to several individual search engines and their databases of
web pages. Within a few seconds, you get back results from all the search engines
queried. Meta-search engines do not own a database of Web pages; they send your
search terms to the databases maintained by search engine companies.
Are "Smarter" Meta-Searchers Still Smarter?
"Smarter" meta-searcher technology includes clustering and linguistic analysis that
attempts to show you themes within results, and some fancy textual analysis and
display that can help you dig deeply into a set of results. However, neither of these
technologies is any better than the quality of the search engine databases they
obtain results from.
Few meta-searchers allow you to delve into the largest, most useful search engine
databases. They tend to return results from smaller and/or free search engines and
miscellaneous free directories, often small and highly commercial.
Although we respect the potential of textual analysis and clustering technologies, we
recommend directly searching individual search engines to get the most precise
results, and using meta-searchers if you want to explore more broadly.
The meta-search tools listed here are "use at your own risk." We are not
endorsing or recommending them.
Better Meta-Searchers
What's Searched
Meta-Search (As of date at bottom of Complex
Results Display
Tool page. They change Search Ability
often.)
Yippy Searches Bing, Ask, Accepts Results accompanied with
yippy.com Open Directory, and Boolean subdivisions based on
(formerly Yahoo (as of 6/15/10). operators AND, words in search results,
Clusty) OR, NOT, and intended to give the major
limiting by themes. Click on these to
"filetype:" and search within results on
"site:". each theme.
Dogpile Searches Google, Yahoo,
www.dogpile.com Bing, and Ask.com (as of
6/15/10). Sites that have
purchased ranking and
inclusion are mixed into the
results. Watch for
25
26. "Sponsored:".
Meta-Search Engines for SERIOUS Deep Digging
What's Complex Search
Meta-Search Tool Results Display
Searched Ability
SurfWax A better than Accepts " ", +/-. Click on source link to
www.surfwax.com average set of Default is AND view complete search
search engines. between words. I results there.
Can mix with recommend fairly Click on to view
educational, US simple searches, helpful "SiteSnap™"
Govt tools, and allowing SurfWax's extracted from most
news sources, SiteSnaps and other sites in frame on right.
or many other features to help you Many additional
categories. dig deeply into features for probing
results. within a site.
Copernic Agent Select from list ALL, ANY, Phrase, Must be downloaded
www.copernic.com of search and more. Also and installed, but Basic
engines by Boolean searching version is free of
clicking on within results under charge. Table
Advanced, then "Find in results" > comparing versions.
"Modify search "Advanced Find"
engine (powerful!).
settings".
Search Basics: Constructing a Google Query
Search engines work by providing you with a screen form containing one or more
fields into which you type your search term (a combination of words and/or phrases).
Single words are quick and easy, but produce much too general a result. With Google,
for example, looking for florists yields 24 million hits (search results). If we narrow
the search to florists in Vancouver (i.e. type florists Vancouver), we come up with
1.7 million results. Narrow further by making your search term a phrase. To do this,
enclose the words in double quotation marks, as in "Vancouver florists". In Google,
this example produces just 27,000 hits, because Google is making a match for the
exact string of characters we typed.
Some search engines provide radio buttons that allow you to specify whether the
search must match Any or All of the terms you type. Most default to All, returning
pages that contain every word used in your search. Choose Any to retrieve pages
that contain one or more of your search words. This AND versus OR distinction is
called Boolean logic, and it's the key to controlling the search engines. To specify an
OR in Google, you must type the word OR between words. In our Vancouver florists
scenario, for example, typing florists OR vancouver results in 85 million hits
because it returns all pages containing either the word florists or the word Vancouver.
26
27. Thus, you might get florists in Hungary and welders in Vancouver! By combining
ANDs, ORs, and phrases, you can begin to build truly powerful queries. Learn these
techniques and many more powerful search strategies in our popular Internet research
course.
Where does the term Boolean originate from?
Boolean searching is built on a method of symbolic logic developed by George
Boole, a 19th century English mathematician. Most online databases and search
engines support Boolean searches. Boolean search techniques can be used to carry
out effective searches, cutting out many unrelated documents.
Is Boolean Search Complicated?
Using Boolean Logic to broaden and/or narrow your search is not as complicated as
it sounds; in fact, you might already be doing it. Boolean logic is just the term used to
describe certain logical operations that are used to combine search terms in many
search engine databases and directories on the Net. It's not rocket science, but it
sure sounds fancy (try throwing this phrase out in common conversation!).
Basic Boolean Search Operators - AND
Using AND narows a search by combining terms; it will retrieve documents that use
both the search terms you specify, as in this example:
• Portland AND Oregon
Basic Boolean Search Operators - OR
Using OR broadens a search to include results that contain either of the words you
type in. OR is a good tool to use when there are several common spellings or
synonyms of a word, as in this example:
• liberal OR democrat
Basic Boolean Search Operators - NOT
Using NOT will narrow a search by excluding certain search terms. NOT retrieves
documents that contain one, but not the other,of the search terms you enter, as in
this example:
• Oregon NOT travel.
Keep in mind that not all search engines and directories support Boolean terms.
However, most do, and you can easily find out if the one you want to use supports
this technique by consulting the FAQ's (Frequently Asked Questions) on a search
engine or directory's home page.
Boolean Search And / Or / Not
This is an algebraic concept, but don't let that scare you away. Boolean connectors
are all about sets. There are three little words that are used as Boolean connectors:
• and
• or
• not
27
28. Think of each keyword as having a "set" of results that are connected with it. These
sets can be combined to produce a different "set" of results. You can also exclude
certain "sets" from your results by using a Boolean connector.
AND is a connector that requires both words to be present in each record in the
results. Use AND to narrow your search.
Search Term Hits
Television 999 hits
Violence 876 hits
Television and violence 123 hits
The words 'television' and 'violence' will both be present in each record.
OR is a connector that allows either word to be present in each record in the results.
Use OR to expand your search.
Search Term Hits
Adolescents 97 hits
Teenagers 75 hits
Adolescents or teenagers 172 hits
Either 'adolescents' or 'teenagers' (or both) will be present in each record.
NOT is a connector that requires the first word be present in each record in the
results, but only if the record does not contain the second word.
Search Term Hits
High school 423 hits
Elementary 652 hits
High school not Elementary 275 hits
Each record contains the words 'high school', but not the word 'elementary'.
Boolean Search Examples Boolean Connectors:
Interactive Text Equivalent
This Boolean demonstration provides a simple example of how Boolean connectors
can help focus your search as finitely as possible.
28
29. THE SCENARIO
Your research topic: television violence
You do a separate search for each keyword and get back the following results:
Television = 999
Violence = 876
That's a lot to wade through. Select 'AND,' 'OR,' or 'NOT' to see how that Boolean
connector will affect this search.
AND
You use 'AND' to connect terms or phrases.
We have two words 'television' and 'violence.' To connect them we use the Boolean
connector 'AND'. Compare the results of the search options below:
SEARCH #1: television
Result: A circle balloons until it fills about half the play area. As it gets bigger we see
the word 'television' appear. When it's finished generating the results show up '=999
results'.
SEARCH #2: violence
Result: A circle balloons until it fills about half the play area. As it gets bigger we see
the word 'violence' appear. When it's finished generating the results show up '=876
results'.
SEARCH #3: television AND violence
Result: The two circles balloon until they fill the play area as in those above. As they
get bigger we see the words 'television' and 'violence' appear. When they're finished
generating the results show up as above, plus, the same in between the two circles is
a different color and it reads as followings:
AND =123 results
OR
You use 'OR' to search for multiple terms or phrases.
You've decided to focus on how violence on television affects a specific age group.
That is, teenagers. But in your searches you've encountered another term that's
frequently used: "adolescents.'
So, in order to get information that uses either term, you'd use the OR connector.
SEARCH: teenager OR adolescent:
Result: Both circles balloon until they fill the play area as above. As they get bigger
we see the words 'teenager' and 'adolescent' appear. When they're finished
generating the results show up as above.
Next 'OR' appears between them, and the two circles come towards one another.
The text 'teenager, 75 result' and 'adolescent 97 results' stay where they are. As the
circles merge (and change into a new color) the 'OR' disappears behind them. When
the merging has finished, the following text appears in the middle of the new circle.
29
30. Teenager OR Adolescent
75 + 97 = 172 results
the 'teenager = 75 results' and 'adolescent =97 results' should now be outside the
circle to the left and right.
NOT
You use 'NOT' to exclude terms or phrases.
In one of your searches you use "high school" as a keyword phrase. You notice that
you get many results which cover both high school and elementary school. The main
emphasis of your research, as you've followed the process, has turned towards how
television violence affects students in high school.
So, in order to eliminate unwanted results you use the NOT connector.
SEARCH: high school
The circle to the left balloons. As it gets bigger we see the words 'high schools'
appear. When it's finished generating the results show up as follows. High school =
423 results.
SEARCH: elementary
The circle to the right balloons. As it gets bigger we see the words 'elementary'
appear. When it's finished generating, the results show up as follows. Elementary =
652 results.
SEARCH: high school NOT elementary
Both circles balloon until they fill the play area as above. When it's finished
generation the results appear as above, but where the circles overlap it reads: NOT =
148 exclusions.
Next the 'elementary' circle and the NOT overlap move away from the high school
circle. The NOT area like a bite taken out of the 'high school' circle.
When the elementary circle and the NOT bite stop, the results in the high school
circle change to:
High school NOT elementary 423 - 148 exclusions = 275
In excluding all references to 'high school' in combination with 'elementary' you get
275 results in which high school is only mentioned.
How the Search Engines Differ
The Web puts a variety of powerful search engines at your disposal, including
Altavista, Google, All The Web, Teoma, Wisenut, and many more. Which is best?
These tools vary in ease of use not to mention features. Your choice of search
engine should be driven by the research challenge you face. Some search engines
are better than others for particular purposes. See below for brief descriptions of
today's major players, their respective strengths and weaknesses, and their
affiliations:
Search Engine Syntax & Features Comparison Chart
An understanding of the syntax differences among search engines is essential to
mastery of these tools and the ability to force them to return the precise results you
30
31. want. Many of these sites appear to operate similarly, at least on the surface. Yet
they can differ substantially in how they understand queries and allow you to filter
results, as well as how they rank the hits returned. Consult our search basics page
for information on syntax and operators, then experiment with the search engines in
the chart provided. To click through to the various search engines, use the HTML
chart below. We have also provided a PDF version of the chart for printing.
Search Boolean Default Phrase Wildcards Case Prefixes Family
Engine sensitive filter
Altavist + - ( ) Phrase, "" Yes No anchor, Yes.
a then * 1-5 applet, Password
AND, OR,
AND characters, domain, protected.
AND NOT,
must type host,
NEAR ( )
first 3 image, like,
(Simple characters link, text,
Srch) title, url
Google OR AND "" Whole word No filetype, Yes
wildcard (*) daterange,
-
cache, link,
+ to
related,
include
info, spell,
stop words
stocks, site,
intitle,
allintitle,
inurl, allinurl
All The AND, OR, AND "" No No site, url, Yes
Web ANDNOT, link, title,
( ), language,
filesize,
+, -
filetype
( ) means
OR
Wisenu +, - AND "" No No language Yes
t
Teoma -, OR AND "" No No intitle, inurl, No
site, inlink,
+ to
lang,
include
afterdate,
stop words
beforedate,
between
date
31
32. Google: Google is the world's most popular search engine. Claiming to search 3.3
billion pages (that's practically the entire Web!), this search engine remains
undisputed king in terms of size. Google produces highly relevant results, using link
popularity for ranking. Google's original claim to fame was its speed, although its
clean, uncluttered interface has also won fans. Google defaults to AND when
processing queries containing two or more words (returning pages that match all
words specified). If you want either word (as in alternate spellings of color), you must
actually force Google to see your search this way, by specifying the Boolean OR
operator, as in color OR colour. Google supports exact phrase searching plus the
ability to exclude words (use the minus sign) and to constrain by domain and other
criteria. Alliances: Google has taken over the Deja newsgroup archive. It powers
hundreds of other search engines and the web search feature of directories like
Yahoo. Google's Web directory is provided by DMOZ.
Altavista: Still the champ in terms of raw search power, Altavista was recently
purchased by Overture, the Net's major pay-per-click search company. Altavista's
index is respectable, at 1 billion pages. It defaults to OR, ordering search results
according to number, location and proximity of search term occurrences. Use
Altavista when you need to construct complex queries containing nested
combinations of AND and OR. Altavista supports the quasi-Boolean operators (+, -)
and the formal Boolean operators (AND, OR, AND NOT, NEAR). This search engine
allows you to constrain your search by domain, location within page, date, and
numerous other criteria. Drawbacks include notoriously buggy hit counts and an
interface that could stand some usability improvements. Alliances: Altavista, too,
powers hundreds of other sites. Its web directory is provided by DMOZ.
All The Web: At first glance, All The Web looks much like Google, providing the
clean look and user-friendliness of the industry leader. All The Web defaults to AND,
with a convenient tick box that allows you to specify a phrase. Its index rivals
Google's, at 3.2 billion documents. It does not recognize formal Boolean arguments,
although it supports quasi-Boolean operators (+, -) and the ability to constrain by
domain, location within page, and several other criteria. Alliances: All The Web was
also recently taken over by Overture.
Wisenut: Known for its clean screen and speedy performance, Wisenut set out to
rival Google. A "clustering" search engine, Wisenut groups results into categories it
calls "WiseGuide." Small plus and minus signs allow you to collapse and expand
these categories. Like Google, Altavista, and other major players, Wisenut is a
spider-based search engine that crawls, links and indexes page contents. Wisenut
claims to have an index of 1.5 billion pages. Wisenut defaults to AND, and supports
phrase searching and the + and - operators, though it offers no advanced search
features as yet. Alliances: Wisenut is owned by Looksmart.
Teoma: Like Wisenut, Teoma set out to emulate Google's clean screen and fast
performance. It too defaults to AND. Teoma's index is a respectable 1.5 billion
pages. Like Google, Teoma evaluates page popularity, using complex relevance and
link popularity algorithms to rank results. Teoma clusters search results at the top of
the screen and displays a list of what it calls "Expert Link Collections" at bottom right.
These listings point to sites Teoma considers authoritative link collections relevant to
the subject of your search. Sometimes called jumplists, link collections can be among
the Web's hidden treasures. Teoma is one of the few search engines to identify
32
34. Teoma search engine is also useful in locating jumplists, which it calls "expert
link collections."
• Sign up for our popular Internet research course to find out more. Among
the many topics covered, you'll learn some little-known but potent Google
techniques for ferreting out the Net's most stubbornly elusive information!
Finding Information on the Internet: A Tutorial
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html
Invisible or Deep Web: What it is, How to find it, and its inherent ambiguity
What is the "Invisible Web", a.k.a. the "Deep Web"?
Why isn't everything visible?
There are still some hurdles search engine crawlers cannot leap. Here are some
examples of material that remains hidden from general search engines:
• The Contents of Searchable Databases. When you search in a library
catalog, article database, statistical database, etc., the results are generated
"on the fly" in answer to your search. Because the crawler programs cannot
type or think, they cannot enter passwords on a login screen or keywords in a
search box. Thus, these databases must be searched separately.
o A special case: Google Scholar is part of the public or visible web. It
contains citations to journal articles and other publications, with links
to publishers or other sources where one can try to access the full text
of the items. This is convenient, but results in Google Scholar are only
a small fraction of all the scholarly publications that exist online. Much
more - including most of the full text - is available through article
databases that are part of the invisible web. The UC Berkeley Library
subscribes to over 200 of these, accessible to our students, faculty,
staff, and on-campus visitors through our Find Articles page.
• Excluded Pages. Search engine companies exclude some types of pages by
policy, to avoid cluttering their databases with unwanted content.
o Dynamically generated pages of little value beyond single use.
Think of the billions of possible web pages generated by searches for
books in library catalogs, public-record databases, etc. Each of these
is created in response to a specific need. Search engines do not want
all these pages in their web databases, since they generally are not of
broad interest.
o Pages deliberately excluded by their owners. A web page creator
who does not want his/her page showing up in search engines can
insert special "meta tags" that will not display on the screen, but will
cause most search engines' crawlers to avoid the page.
34
35. How to Find the Invisible Web
Simply think "databases" and keep your eyes open. You can find searchable
databases containing invisible web pages in the course of routine searching in most
general web directories. Of particular value in academic research are:
• ipl2
• Infomine
Use Google and other search engines to locate searchable databases by searching a
subject term and the word "database". If the database uses the word database in its
own pages, you are likely to find it in Google. The word "database" is also useful in
searching a topic in the Google Directory or the Yahoo! directory, because they
sometimes use the term to describe searchable databases in their listings.
Examples:
plane crash database
languages database
toxic chemicals database
Remember that the Invisible Web exists. In addition to what you find in search
engine results (including Google Scholar) and most web directories, there are other
gold mines you have to search directly. This includes all of the licensed article,
magazine, reference, news archives, and other research resources that libraries and
some industries buy for those authorized to use them.
As part of your web search strategy, spend a little time looking for databases in your
field or topic of study or research. The contents of these may not be freely available:
libraries and corporations buy the rights for their authorized users to view the
contents. If they appear free, it's because you are somehow authorized to search and
read the contents (library card holder, company employee, etc.).
The Ambiguity Inherent in the Invisible Web:
It is very difficult to predict what sites or kinds of sites or portions of sites will or won't
be part of the Invisible Web. There are several factors involved:
o Which sites replicate some of their content in static pages (hybrid of
visible and invisible in some combination)?
o Which replicate it all (visible in search engines if you construct a
search matching terms in the page)?
o Which databases replicate none of their dynamically generated pages
in links and must be searched directly (totally invisible)?
o Search engines can change their policies on what they exclude and
include.
Want to learn more about the Invisible Web?
• The Wikipedia "Deep Web" article provides a fairly up-to-date summary, with
links to other resources.
35
36. 10 Search Engines to Explore the Invisible Web
by Saikat Basu March 14, 2010
Image credit: MarcelGermain
Saikat Basu
Saikat is a techno-adventurer in a writer's garb. When he is not scouring the net for
tech news, you can catch him looking for life hacks and learning tidbits.
The Invisible Web refers to the part of the WWW that’s not indexed by the search
engines. Most of us think that that search powerhouses like Google and Bing are like
the Great Oracle”¦they see everything. Unfortunately, they can’t because they aren’t
divine at all; they are just web spiders who index pages by following one hyperlink
after the other.
But there are some places where a spider cannot enter. Take library databases
which need a password for access. Or even pages that belong to private networks of
organizations. Dynamically generated web pages in response to a query are often
left un-indexed by search engine spiders.
Search engine technology has progressed by leaps and bounds. Today, we have
real time search and the capability to index Flash based and PDF content. Even
then, there remain large swathes of the web which a general search engine cannot
penetrate. The term, Deep Net, Deep Web or Invisible Web lingers on.
To get a more precise idea of the nature of this “˜Dark Continent’ involving the
invisible and web search engines, read what Wikipedia has to say about the Deep
Web. The figures are attention grabbers ““ the size of the open web is 167 terabytes.
The Invisible Web is estimated at 91,000 terabytes. Check this out – the Library of
Congress, in 1997, was figured to have close to 3,000 terabytes!
How do we get to this mother lode of information?
That’s what this post is all about. Let’s get to know a few resources which will be our
deep diving vessel for the Invisible Web. Some of these are invisible web search
engines with specifically indexed information.
Infomine
36
37. Infomine has been built by a pool of libraries in the United States. Some of them are
University of California, Wake Forest University, California State University, and the
University of Detroit. Infomine “˜mines’ information from databases, electronic
journals, electronic books, bulletin boards, mailing lists, online library card catalogs,
articles, directories of researchers, and many other resources.
You can search by subject category and further tweak your search using the search
options. Infomine is not only a standalone search engine for the Deep Web but also a
staging point for a lot of other reference information. Check out its Other Search
Tools and General Reference links at the bottom.
The WWW Virtual Library
This is considered to be the oldest catalog on the web and was started by started by
Tim Berners-Lee, the creator of the web. So, isn’t it strange that it finds a place in the
list of Invisible Web resources? Maybe, but the WWW Virtual Library lists quite a lot
of relevant resources on quite a lot of subjects. You can go vertically into the
categories or use the search bar. The screenshot shows the alphabetical
arrangement of subjects covered at the site.
Intute
37
38. Intute is UK centric, but it has some of the most esteemed universities of the region
providing the resources for study and research. You can browse by subject or do a
keyword search for academic topics like agriculture to veterinary medicine. The
online service has subject specialists who review and index other websites that cater
to the topics for study and research.
Intute also provides free of cost over 60 free online tutorials to learn effective internet
research skills. Tutorials are step by step guides and are arranged around specific
subjects.
Complete Planet
Complete Planet calls itself the “˜front door to the Deep Web’. This free and well
designed directory resource makes it easy to access the mass of dynamic databases
that are cloaked from a general purpose search. The databases indexed by
Complete Planet number around 70,000 and range from Agriculture to Weather. Also
thrown in are databases like Food & Drink and Military.
For a really effective Deep Web search, try out the Advanced Search options where
among other things, you can set a date range.
Infoplease
38
39. Infoplease is an information portal with a host of features. Using the site, you can tap
into a good number of encyclopedias, almanacs, an atlas, and biographies.
Infoplease also has a few nice offshoots like Factmonster.com for kids and Biosearch,
a search engine just for biographies.
DeepPeep
DeepPeep aims to enter the Invisible Web through forms that query databases and
web services for information. Typed queries open up dynamic but short lived results
which cannot be indexed by normal search engines. By indexing databases,
DeepPeep hopes to track 45,000 forms across 7 domains.
The domains covered by DeepPeep (Beta) are Auto, Airfare, Biology, Book, Hotel,
Job, and Rental. Being a beta service, there are occasional glitches as some results
don’t load in the browser.
IncyWincy
IncyWincy is an Invisible Web search engine and it behaves as a meta-search
engine by tapping into other search engines and filtering the results. It searches the
web, directory, forms, and images. With a free registration, you can track search
results with alerts.
DeepWebTech
39
40. DeepWebTech gives you five search engines (and browser plugins) for specific
topics. The search engines cover science, medicine, and business. Using these topic
specific search engines, you can query the underlying databases in the Deep Web.
Scirus
Scirus has a pure scientific focus. It is a far reaching research engine that can scour
journals, scientists’ homepages, courseware, pre-print server material, patents and
institutional intranets.
TechXtra
40
41. TechXtra concentrates on engineering, mathematics and computing. It gives you
industry news, job announcements, technical reports, technical data, full text eprints,
teaching and learning resources along with articles and relevant website information.
Just like general web search, searching the Invisible Web is also about looking for
the needle in the haystack. Only here, the haystack is much bigger. The Invisible
Web is definitely not for the casual searcher. It is a deep but not dark because if you
know what you are searching for, enlightenment is a few keywords away.
Do you venture into the Invisible Web? Which is your preferred search tool?
The Invisible Web Databases
Which database might have Turbo10 Search user-selected deep
the information I need? Web resources
Resource Discovery Keyword search
Network
Complete Planet Deep Web directory
Digital Librarian and Uncover databases
Librarians Guide to
the Internet
News and magazines Google News Search 30 day news archive
(for US, UK, others)
AltaVista News Includes New York Times
1st Headlines Breaking news in categories
(US & World; Business;
Health; Lifestyles; Sports;
Technology; Weather)
New York Times Full-text newspaper archive
Washington Post search (14 or 30 day trials
Seattle Times available)
San Francisco
Chronicle
HeadlineSpot Search news directory by
media, region, subject,
opinion
41
42. Directory of Open Search or browse by subject
Access Journals for peer-reviewed, scientific
(DOAJ) and scholarly titles
HeadlineSpot: Search magazine directory
Magazines by subject
Public Radio webcasts PublicRadioFan.com Search database of program
listings
History Guide to History on Database of more than
the Web 5,000 US and world history
sites
Biography Galileo Project, Individuals
Thomas A. Edison
Papers
Biography.com 25,000 people
Biographical 28,000 short identification
Dictionary information
Countries Nations Online Alphabetical index to
Project, Thomas A. government Web pages
Edison Papers
Portals to the World From the Library of
Congress
World Fact Book From the CIA
Infonation U.N. member nations
Country Profiels From the BBC
Data Finding and Using
Statistical Data
Books (full text) Online Books Page Free e-books
42
43. Outstanding literature Literature, Math and CA Dept. of Ed.
Science Literature recommended literature for
K-12
HAISLN Recommended reading lists
YALSA (ALA) Outstanding Books for the
College Bound
Photographs Digital Library Photos 80,000 images of California
and natural world
Time Life Pictures Historical and current (Getty
Images)
Fine Arts National Gallery of Search 17,000 images
Art (check "images only")
ImageBase Search 85,000 images in the
Fine Arts Museums of SF
Artcyclopedia Fine arts search engine
Contemporary Art Search by medium and
theme
Cross-disciplinary Literature, Arts and Browse or search annotated
Medicine Database bibliography of prose,
poetry, film, video and art --
comprehensive (adult and
young adult fiction) resource
for medical humanities
Education ERIC Education journals and other
resources; Check "full-text,"
limit by publication type in
advanced search
K-12 curriculum projects Blue Web'n PacBell project
American Memory Lessons using primary
43