Mais conteúdo relacionado Semelhante a Surfacing the deep web (20) Surfacing the deep web1. WebSearch Academy
Internet Librarian International
Surfacing the Deep Web
Arthur Weiss
Email: a.weiss@aware.co.uk / Twitter: @awareci
www.marketing-intelligence.co.uk
14 October 2013
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
2. Not everything can be
found with Google….
The ‘Invisible Web’ or ‘Deep
Web’ consists of web pages
and documents which are
not indexed by conventional
search engines or are poorly
or incompletely indexed.
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
3. 5 Types of “Invisibility”
Not search
engine
optimised
so pages fail
to appear in
“simple”
searches
© AWARE 2013
Not indexed
by search
engines
Excluded
from search
index
Subscription
or
proprietary
content
Encrypted
or nonindexable
content
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
3
4. Know your tool kit
or
Standard Google
© AWARE 2013
Multiple approaches
& tools
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
4
5. What do I need to find?
What sort of needle? What sort of haystack?
http://www.morguefile.com/archive/display/21091
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
5
6. Why will the information be available?
Where will it be held
(Who will know it?)
Can I obtain it legally and ethically from
this source & if so, how?
If not, are there other sources or ways of
obtaining the information?
After obtaining the information are
any checks needed to verify it?
What is the information’s relationship to
other information?
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
6
7. Not everything is online or can be found!
• Try to find:
Original TV coverage of the storming of the
Bastille1
A newspaper interview with Christopher
Columbus, following his return from
discovering America
A recording of Abraham Lincoln delivering the
Gettysburg address
A photo of Jesus in his crib (Question from a 9
year old: “Why didn’t anybody take photos
with their phones?”)
1 With
thanks to Karen Blakeman of RBA Information (rba.co.uk) for these examples
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
8. “Forty-two! Is that all you’ve got to show for
seven and a half million year’s work?”
“I checked it very thoroughly and that quite
definitely is the answer. I think the problem,
to be quite honest with you, is that you’ve
never actually known what the question is.”
Douglas Adams, “The Hitchhiker’s Guide to the Galaxy”
If your search approach is wrong, it doesn’t
matter which approach or tool you use, or how
you use it. Your results will be poor or wrong.
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
9. Before starting to search consider
sources for the subject / topic of interest…
Why is information likely to be available?
Consider also file-formats, and location of search terms
What search tool / approach is most likely to
access or index the information’s location (container)
Are there unique terms or jargon that lead to a specialist tool
e.g. Lung cancer (consumer) versus pulmonary carcinoma (medical)
Are there societies, organisations, people, or groups
that may have information? (Who/where else could have information?)
Would any of the relevant pages be in another language?
“cheap hotel in Dubai” OR “”ﻓﻨﺪق اﻗﺘﺼﺎدي ﻓﻲ دﺑﻲ
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
9
10. Before starting to search: consider search
terms for the topic or subject of interest
Are there any synonyms or variant spellings?
Tyre or tire; Aluminum
Candy or sweet
Basle or Basel
Are there any other words likely to be in documents on the
topic?
Are any keywords part of a common phrase?
Are any keywords likely to be in irrelevant documents
that should be excluded from searches?
How might the information be written?
“I work for Xcompany” to search for
employees of Xcompany
© AWARE 2013
“X is better than” for comparisons
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
10
11. Research Planning
Information
Requirements
© AWARE 2013
Break down into
individual
questions that,
when answered,
will provide the
required
knowledge
Don’t start
searching
without
knowing what
you are looking
for, and why
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
11
12. An example research plan
Copy & fill in sheet for each key information question / topic
Research Topic
Research Questions (breakdown topic into answerable questions)
Sources
LINKEDIN
GOOGLE
SCHOLAR
NATIONAL
STATISTICS
© AWARE 2013
Search Approach /
Parameters
JOB TITLE, CURRENT
EMPLOYER, ETC.
AUTHOR NAME, TOPIC,
DATE, ETC.
SITE SEARCH ENGINE
Type of information
expected
Comments / Possible
problems
PEOPLE PROFILES
MAY NOT BE ACCURATE
OR IN-DATE
CITATIONS, ACADEMIC
DOESN T COVER
RESEARCH PAPERS .
EVERYTHING
CENSUS & DEMOGRAPHIC MAY BE OLD OR
DATA
INCOMPLETE
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
12
13. Types of “Invisibility”
Not search
engine
optimised
so pages fail
to appear in
“simple”
searches
© AWARE 2013
Not indexed
by search
engines
Excluded
from search
index
Subscription
or
proprietary
content
Encrypted
or nonindexable
content
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
13
14. Advanced Searching
• Use advanced search operators and options e.g.
Filetype: / InTitle: / InUrl: / .. (numeric) and *
(wildcard)
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
14
15. Search Engines – not just Google
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
16. Types of “Invisibility”
Not search
engine
optimised
so pages fail
to appear in
“simple”
searches
© AWARE 2013
Not indexed
by search
engines
Excluded
from search
index
Subscription
or
proprietary
content
Encrypted
or nonindexable
content
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
16
17. Specialist Search / Deep Web Search
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
17
18. Search for Information “Containers”
• Knowing a reason for the information to be
available can lead to an information source
Who else would want this information?
Search for topic + “Database”
e.g. Coffee database – first two results:
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
18
19. Case Examples – Economics by Country
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
19
20. Case Examples – Trade Statistics
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
20
21. Case Examples – Economic Indicators
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
21
22. Case Examples – Genealogy
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
22
23. Types of “Invisibility”
Not search
engine
optimised
so pages fail
to appear in
“simple”
searches
© AWARE 2013
Not indexed
by search
engines
Excluded
from search
index
Subscription
or
proprietary
content
Encrypted
or nonindexable
content
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
23
24. Proprietary sites / Blocked from Index
• Register for password protected sites
• Use site search or site map – if available
• If Robots.txt file exists may be able to view the
hidden pages e.g. nytimes.com/robots.txt
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
24
25. Types of “Invisibility”
Not search
engine
optimised
so pages fail
to appear in
“simple”
searches
© AWARE 2013
Not indexed
by search
engines
Excluded
from search
index
Subscription
or
proprietary
content
Encrypted
or nonindexable
content
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
25
26. Content that can’t / won’t be indexed
• Non-textual information e.g. multimedia /
audiovisual
Bing has search operators that can find RSS feeds
(hasfeed:) and pages containing specific types of file
(e.g. mp3 files – contains:mp3)
Search for related textual information e.g. descriptions,
or sources (e.g. artwork or film titles)
• Encrypted information / .Onion sites
Project Tor (torproject.org) and the TOR browser
Access encrypted sites via proxy servers
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
26
27. Searching TOR
• On regular Google: fake passport site:onion.to
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
27
28. TOR / .Onion Sites
© AWARE 2013
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
28
29. Any Questions?
Arthur Weiss is the managing director of AWARE - a UK based
consultancy specialising in marketing & competitive intelligence analysis.
Contact Details:
Web Sites:
www.marketing-intelligence.co.uk
E-mail: a.weiss@aware.co.uk
Twitter: @awareci
Telephone:
Fax:
© AWARE 2013
+44 20 8954 9121
+44 20 8954 2102
Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
29