Searching
the Visible and Invisible Web Presentation
by Laura Gordon-Murnane, MLS This Page is
Available at: |
| Expectations
for Searching the Web |
The Growing
Importance of using the Web for to find Information A recent report published by the Pew
Internet & American Life Project Counting
on the Internet concluded that the
Internet has become a mainstream information tool. Its
popularity and dependability have raised all
Americans expectations about the information and
services available online. The survey revealed that when
Americans are looking for information about health care,
government information, news, and commerce about
two-thirds of all Americans say that they expect to find
such information on the web. Americans expect to find the
information on the Internet how successful they
are in finding what they need is the challenge. ------ How do People
Find the Information they Want? In July of 2002, the Pew Internet
Project released a memo
stating that to find information on the internet
search engines have become an indispensable utility for
internet users. Search engines have become an
indispensable utility for Internet users. More than eight
in ten American Internet users have gone to search
engines to find information on the Web. More than one in
four U.S. Internet users--about 33 million
adults--present queries on search engines on a typical
day. Topics can range from the ridiculous ("How many
times does my name come up on Google?") to the
sublime ("Where was Buddha born?") to the
heartbreaking ("My mom has breast cancer--I need
information fast"). But the strategies are similar
for all these questions--type keywords into a search
engine and go from there. ------ Are they
Successful in Finding the Information they Need? "According to a recent study from iProspect,
three-quarters of Internet users use search engines.
However, 16 percent of Internet users only look at the
first few search results, while 32 percent will read
through to the bottom of the first page." "The study also indicates that 52.1 percent of
Internet users choose the same search engine or directory
when searching for information, while 35 percent
alternate among a number of favorite search
properties." "Only 7.5 percent of Internet users said they
refined their search with additional keywords in cases
where they were unable to achieve satisfactory
results." Most search engine users only look at the first
few results and hope that the information they need
is contained in those first few results. ------ Search Engine
Loyalty and the Success of the Searches The iProspect survey also revealed user allegiance,
with more than half (52.1 percent) usually using the same
search engine or directory, and almost 35 percent using
several different ones interchangeably. Only 13 percent
said that they use different search engines for different
types of searches. Interestingly, less than half (45.9 percent) felt that
their searches were successful almost all the of the
time, but when they are unsuccessful, 27.2 percent of the
respondents switch to another search engine, rather than
refining the search with more words (7.5 percent).
One-third of the survey participants indicated a success
rate three-quarters of the time, and 13.3 percent found
what they were looking for half of the time. ------ Conclusions
to be Drawn 1) The expectation is that the information is on the
internet. 2) Using search engines you will be able to find the
information you want. 3) The success rate of these searches is less than 50
percent. 4) Only 13.3 percent found what they were looking for
half of the time. These are not great searching results! |
Limitations of Search
Engines
One of the great benefits of the Internet is that anyone
with a connected computer can get useful information without much
effort. You just type a topic into Google and up comes a long
list of Web pages that might tell you about it.
But as brilliant as Google is, this process has
several limitations. First of all, in most cases Google doesn't
actually provide you an answer, just a list of links to Web pages
where information might be found. So getting the exact
information you want requires more steps: You have to browse
through the links Google offers, pick out one that looks good,
then go to it and look for the relevant material.
Second, you're doing all this in a general,
undifferentiated piece of software called a Web browser that
isn't designed to help you drill down into information.
Third, neither the browser nor Google gives you
a good sense of the credibility of the sources that turn up, just
their popularity.
| Definitions
of The Visible and Invisible Web |
Definition 1:
Visible Web
The Visible Web Consists of material found on the
web that general search engine (Google, Altavista,
AlltheWeb, etc.) can find and make searchable and easily
accessible.
Definition 2:
Invisible Web
Invisible Web consists of material found "on the web" that general search tools (Google, AltaVista, Teoma, etc.) cannot, will not, or do not crawl/index/make searchable and easily accessible.
| On
the Web vs. Via the Web |
Information
"On the Web" Information
"Via the Web" Variations To
Be Aware Of (Particularly from End User Perspective) |
| How
Do on the Web Search Engines Work? |
Search Engines Consist of Three Parts
1.
The Web Crawler
2.
The Indexer
3.
The Query Processor
Myth
All Search Engines are Alike 1.
All Search Engines are not the Same 2.
Comprehensiveness, currency, coverage 3.
Interface, syntax, capabilities all unique 4.
Different algorithms lead to different results 5.
If you dont find what you need using commercial
search engines could indicate that you need to
consider an invisible web tool Learn
more than one Search Engine |
| Web Crawlers (Spiders) |
The web crawler or spider find and retrieves web pages.
The web crawler identifies pages in 2 ways:
a.
Add url form
b.
Harvest hypertext links embedded on the page
Points to consider
1)
Harvesting of links creates a very large pool of pages to visit.
The crawler must determine if the page has already been visited
and if it is already in the search engines index. If
the url is already in the index, the crawler has to determine if
the information is still current or if the information in the
search engines index is out of date and needs to be
updated.
2)Crawling
the web - resource-intensive.
3)Do
not assume that every search engine will crawl and index the
sites entire set of pages.
| Search Engines Indexes |
1.
Indexes every word on every page and stores in a huge database.
2.
The search engine stores the full text of the pages and
this allows for the search engine to offer more than just simple
single keyword matching.
3.
Offer proximity searching that will match multi-word phrases,
sentences and bigger sections of text.
| The Query Processor |
1.
Most Complex Piece of the Search Engine
2.
Query Processor has three parts search form, the engine
that processes the request, the results page.
3.
Search Form and Results page are similar for all web search
engines.
4.
Key Difference between search engines the way relevance is
calculated
a.
Statistical Analysis of Text
b.
Link Analysis
| Myth Search Engine Indexes are Current |
Remember Search engines search their index not
the current web.
Crawling the web is resource intensive the search
engine has to determine how frequently it will recrawl the page.
Many search engines have increased their recrawl rates.
However, Web is too large for any one search engine to provide
comprehensive coverage. Too many new pages, too expensive
to recrawl every new page. This is for the Visible Web.
What about the Invisible Web. Why cant crawlers
find Invisible web pages?
| Why Search Engines Cant Find Pages |
| Technical
Issues Invisible Web A) Opaque Web 1)
Disconnected Pages no links to find the page 2)
Page not Submitted to an Engine/Engines (this is a
secondary way engines learn of new content) 3)
Depth of Crawl the crawler does not crawl the
entire site. (tip: use the search engine at the source
for better search results) 4)
Maximum Number of Viewable Results 5)
Size Limitation Each engine decides how much
of the page it will crawl a.
Google only indexes the first 101k of the page b.
AllTheWeb - indexes the entire page c.
Altavista - indexes the first 110k of the entire
page 6)
Frequency of Crawl Content on the page is not
current a.
Each engine is different b.
Many pages have a minimum of 30-45 Days after Discovery 7)
Recrawl Rates Each Engine is Different B) Private Web 1)
Robots Exclusion Protocol (Don't Crawl and
Index My Content) 2)
Noindex Meta Tag specifies specific page or
pages the crawler is not supposed to crawl. 3)
Firewall is in place Only those authorized to
gain access are allowed in 4)
Password protected pages C) Proprietary Web 1)
Password Protection 2)
Firewall Only those authorized to gain access
are allowed in 3)
Use legacy database systems available long
before Web existed D) Invisible Web 1)
Non-html Text Search engines are designed to
index html text audio, video, images hard
for search engines to understand (Altavista and hotbot
can do limited searching for non-text files) 2)
Multiple Formats Not every format is crawled
by every search engine pdf (postscript), flash,
shockwave, executable programs or compressed files
technically indexable but until recently ignored by
search engines. To index these formats is
resource-intensive Google, AllTheWeb, and
Altavista are now indexing pdf files and MSN Search has
included word, excel, and powerpoint formats. (Take
a look at ResearchIndex
(NEC Research Institute) indexes pdf files
also creates a citation index easy to locate
related documents). 3)
Codes and Frames are difficult for web search engines
a.
Frames are not accessible b.
Javascript pop-up windows spiders dont
follow the javascript commands 4)
Registration Forms the form cannot be
completed by the spider blocking access to the
information a.
Relational Databases 1.
Why do web content developers use databases
flexibility, easily maintained 2.
Web front ends to provide access to proprietary
systems that are now open - these databases are
available via the web 5)
Dynamically Generated Pages search engine
refuses to crawl any material past the ? - the
spider sees the ? and stops http://us.imdb.com/Tsearch?title=igby+goes+down&restrict=Movies+only&GO.x=21&GO.y=8
a.
Spider Traps b.
Sites generated dynamically from a database (.cfm, .asp,
.cgi) 6)
Different pages crawled by different engines (no
overlap) No two engines
are alike (Lycos and AllTheWeb use the same Database). 7)
Real-Time Content spiders do not crawl/recrawl
in real-time (too much information, no real good reason
to spider this type of information (stock quotes,
weather, airline flight arrival/departure information).
Trip.com (track
flights) |
| Social/Cultural
Reasons Information on the Web is Invisible |
1)
Many People Only Use a Single Search Tool learn more than
one search tool
2)Internet
searchers only look at the first few results
| Why
Use Invisible Web Sources? |
| 1)
Specialized Content Focus - more comprehensive Results
subject specific 2)
Specialized Search Interface more control over
search input and output 3)
Increased precision and recall 4)
Invisible Web Resources highest level of authority 5)
Answer may not be available anywhere else. |
| When
to Use Invisible Web Resources? |
| 1)
When you are familiar with a subject. 2)When
you are familiar with a specific search tool. 3)When
you are looking for a precise answer. 4)When
you want authoritative, exhaustive results. 5)
When timeliness of content is an issue. |
Recent article in the Chronicle of Higher
Education New Allies
in the Fight Against Research
by Googling
Faculty members and librarians slowly start to work together on courseware (March 21, 2003). Introduce students to authoritative sources not just Google.
n
Google is not the only research tool available
n
Subject Specific Databases
n
Highly authoritative research tools
n
Information is already organized for optimal results
| Invisible
Web Resources and Specialized Search Tools |
Architecture/Art/Museums
Metropolitan
Museum of Art Online Collection
The
National Gallery of Art
(Washington D.C.)
Search the entire collection by artist name
or title of work. Images are available for many items.
Search form: http://www.nga.gov/search/search.htm
Gateway to Information
This Directory is provided as a means to
access nearly 3,000 libraries and library departments with
specialized holdings in art, architecture, and archaeology
throughout the world. Data recorded for each institution includes
address, telephone and tele-facsimile numbers, hours of
operation, annual closings, and listings of professional
personnel. It also includes electronic mail addresses of
individual librarians and direct web links to institutional home
pages. Provided by the IFLA (International Federation of Library
Associations and Institutions) Section of Art Libraries.
Search form: See
Main Page
| Other Examples of Invisible Web Resources |
Public Company Filings
EdgarIQ (Public
Company Filings)
One of many interfaces to this SEC EDGAR
material. EdgarIQ provides free real-time online access and
full-text search of the EDGAR system.
Search form: See Main Page
Other Sources FreeEdgar
Other Sources Securites
and Exchange Commission Edgar
Telephone Numbers
Anywho.Com
(Telephone Directory)
One of many phone directory databases on the
Internet. Anywho.com - residential and business
listings.
Customized Maps and Driving Directions
This service uses data from the US Census
and a supplementary list of cities around the world to find the
latitude and longitude of two places, and then calculates the
distance between them.
Search form: See
Main Page
Clinical Trials
The U.S. National Institutes of Health,
through its National Library of Medicine, has developed current
information about clinical research studies.
Search
form
Entertainment
Internet
Movie Database
Search Form: http://us.imdb.com/search
Patents
U.S. Patent Databases
(U.S. Patent and Trademark Office)
Numerous searching options including full
text and bibliograhic databases. Full text of all
Search form: See Main Page
Related resource 1: (Australia)
Australian Patent Databases
Related resource 2: (Canada) Patent
Database
Related resource 3: (U.K.)
Patent Search
Library Catalogs
Library of
Congress Online Catalog
The Library of Congress Online Catalog ( http://catalog.loc.gov/) is a
database of approximately 12 million records representing books,
serials, computer files, manuscripts, cartographic materials,
music, sound recordings, and visual materials in the Library's
collections. The Online Catalog also provides references, notes,
circulation status, and information about materials still in the
acquisitions stage.
Search form: http://catalog.loc.gov/
Related resource 1: Library of
Congress Archival Finding Aids
| Examples
of Specialized Search Tools |
Video/Audio Search
PBS
NewsHour with Jim Lehrer - Keyword Search, Watch
Video/List Audio
National Public Radio Audio
Archives
Speechbot - search for
radio programs by keyword, searches across radio programs (uses
speech recognition software to create a transcript of the program
and then builds an index of the words spoken during the program).
The FeedRoom - Real-Time
TV News Text Transcripts
CapitolHearings.org
- Listen/Watch Senate Hearings (Live)
Web Archives
The WayBack Machine
The Wayback Machine, a service from the Internet Archive and
Alexa Internet, allows people to access and use archived versions
of stored websites. Visitors to the Wayback Machine can type in
an URL, select a date, and then begin surfing on an archived
version of the web. The Wayback Machine is built so that it can
be used and referenced by anybody and everybody.
Cached Pages
Google
Google News
Daypop
Incy Wincy (A small web
engine, many pages cached in November, 2002)
Yuntis (An
experimental engine from State University of New York,
Stonybrook)
Gigablast
another search engine that caches pages
9/11/01 Television
Archive Collection of Television News Broadcasts
following 9/11
| Specialized
Web Search Resources by Major Web Search Engines |
AllTheWeb
AlltheWeb
News
Comment: The AlltheWeb News spider crawls 3,000 news sites (both
national and international) continuously. The database is
separate from the Web index.
AlltheWeb Multimedia Catalogs
ATW searches millions of videos, images, and sound files.
Pictures
Includes advanced format options
File Formats: (jpg, gif, bmp), Type (color, b&w, line art),
Background (transparent and non-transparent)
Videos:
Limits include: Format (Real, QuickTime, AVI) and Stream/Download
Audio
Files
MP3 Files
No Advanced Searching Features Available at this time.
------------------------------------------------------------------------------
Google
Google Catalogs (BETA)
Advanced Interface: http://catalogs.google.com/advanced_catalog_search
Google has made it easy to find information published in
mail-order catalogs that were not previously available online.
Search the full-text from over 4,500 mail-order catalogs from US
Companies. Google uses optical character recognition software to
scan each page and creates an image file. The OCR software finds
keywords embedded in the scanned image files.
Google News
Advanced interface: Not Available
Google News presents information culled from approximately 4,500
news sources worldwide and automatically arranged to present the
most relevant news first. Topics are updated continuously
throughout the day, so you will see new stories each time you
check the page. Google has developed an automated grouping
process for Google News that pulls together related headlines and
photos from thousands of sources worldwide -- enabling you to see
how different news organizations are reporting the same story.
You pick the item that interests you, then go directly to the
site which published the account you wish to read.
Google News is highly unusual in that it
offers a news service compiled solely by computer algorithms
without human intervention.
You can trace the history of a developing
issue by clicking the "sort by date" function on the
page containing all reports on a given topic. This will arrange
the stories in chronological order, with the most recent report
placed first.
Uncle Sam
Limits your search to .gov, .mil sites, and some
state material.
Google's Image Search consists of more than 425 million images
indexed and available for viewing.
Google analyzes the text on the page adjacent to the image,
the image caption and dozens of other factors to determine the
image content. Google also uses sophisticated algorithms to
remove duplicates and ensure that the highest quality images are
presented first in your results.
Be careful about copyright issues. Google does not
provide copyright permission.
-----------------------------------------------------------------------------------------------------------------------
Altavista
Altavista News
AltaVista gathers news from 3,000 worldwide sources. They receive
news feeds from Moreover Technologies and news sites like the New
York Times or Forbes, and other news sources. News stories are
updated continuously. AltaVista provides news search
functionality in
AltaVista Multimedia Search
Access to approximately 118 million images, videos, and sound
files. Again, stay on top of all copyright issues before
using this material. Most of the material found in these
databases is not directly accessible via the primary AV
interface. Some of AV's advanced syntax will work with the
multimedia engines. In addition, AltaVista has a few paying
partners who provide a direct feed into the database. For
example, a search of the video database will find new video
content from MSNBC.
AltaVista
Images
Advanced
interface
Limits include: Type (color, b&w, banners)
AltaVista
Audio
Advanced
interface
Limits include: Format (mp3, wav, etc.) and Stream/Download,
Duration (Less or Greater than 1 Minute)
AltaVista
Video
Advanced
interface
Limits include: Format (Avi, Quicktime, MPEG, etc.) and
Stream/Download, Duration (Less or Greater than 1 Minute)
Comment: If you want video content from events in the news, use
the Advanced Video Search interface and limit your search to only
MSNBC material.
Teoma
Instead of ranking results based on the sites with the most links
leading to them, Teoma analyzes the Web as it is organically
organizedin naturally-occurring communities that are about
or related to the same subjectto determine which sites are
most relevant. Teoma is the only search technology that can
locate communities on the Web within their specific subject
areas, as they actually exist. And this allows us to finely tune
our search process, providing more precise results.
To determine the authorityand thus the overall quality
and relevanceof a site's content, Teoma uses Subject-Specific
PopularitySM. Subject-Specific Popularity ranks a
site based on the number of same-subject pages that reference it,
not just general popularity.
Example: new
source review
Refine
Teoma organizes sites into naturally occurring communities that
are about the subject of each search query. These communities are
presented under the heading "Refine" on the
Teoma.com results page. This tool allows a user to further focus
his or her specific search.
Results
Next, after identifying these communities, Teoma employs a
technique called Subject-Specific PopularitySM.
Subject-Specific Popularity analyzes the relationship of sites
within a community, ranking a site based on the number of
same-subject pages that reference it, among hundreds of other
criteria.
Resources
Finally, by dividing the Web into local subject communities,
Teoma is able to find and identify expert resources about a
particular subject. These sites feature lists of other
authoritative sites and links relating to the search topic.
| Keeping Current: Resources to Monitor |
Gary Prices ResourceShelf.com
Librarians' Index to the Internet
InfoMine
Marylaine
Blocks Neat New Stuff I found on the Web this Week
Free Pint
Internet Resources
Newsletter
(Monthly)
LLXR and LLRX Buzz
Scout Report & Scout
Report Archive
Cool Tools and Tips
Must Have
Google Toolbar | UltraBar Tool
--------------------------------------------------------------------------------------------------
Google Search Tips
Linked page: Find external pages that point to an URL
Search Syntax: link:http://www.website.com
Example: link:http://www.nytimes.com
(whos linking to whom?)
Restrict your search to a specific site
Search Syntax: query site:http://www.domain.com
Example: Library
of Congress
Similar pages : Find pages that are related to a result
(web pages with similar or close contents and topics)
Search Syntax: related:www.website.com
Example: TrekBikes.com
related:www.trekbikes.com
| The
Need for Specialized Tools & Knowledge of the
Invisible Web |
·
The Web and General Web Databases will continue to Grow Larger
·
Existing and New Specialized Databases will be released and made
available.
·
To improve the chances of finding information these
specialized databases will increase in importance.
·
In Many Cases Specialized Tools, Invisible and Specialized Tools
Have Interfaces "Customized" for the Specific
Data in the Database
for example Government Accounting office advanced
search.
·
Ability to Sort, View Data in Ways Specific to the Data Set
use the search tool that is designed to search the
database or material you will have much greater control
over the search and hopefully you will have better results
·
Bigger Databases Translates - More Recall, Lower Precision (more
pages, not necessarily results that are on target)
·
Focused Databases, Smaller Universe of Materials to Search
Through
·
Greater Ability To "Work With the Data" (Sort, Limit,
etc.)
·
The Authority of Author Increasing Important, You Know Where It's
Coming From
·
In Many Cases, You Don't Start Searching for a Phone Number in an
Encyclopedia
·
The Right Tool for the Job Encyclopedia, Phone Numbers
·
Think Resources, "Learn" Them Like You Learn
Traditional Reference Tools
·
LexisNexis, Dialog, Offer Many Databases Depending on the
Information You're Looking to Access
·
Deciding Where and What to Search is a Skill That Info
Professionals Have
·
Even Larger Databases to Search Through (the databases grow
larger)
| Conclusions
|
·
Be Aware of the Limitations of General Engines
·
No Single Engine Indexes the Entire Web
·
Use More than a Single Engine
·
Even if it "Might" Be Accessible in a General Engine
Would a Focused Engine Get it To You More Quickly?
Internet Archive
Example (10 Billion Pages Archived)
ImagesCanada ||| Typical
Search in Google?
Where is the ImagesCanada Database?
·
The Challenge is Learning the many Different Resources available
(both Visible and Invisible) and Being Able to Access it Quickly
·
Web Collection Development will be more important
Building a Collection, Knowing What's Available both
visible and invisible web tools
·
Think of Specialized and Invisible Web Tools Like You Think of
your Reference Collection
·
Future? Federated Searching, Broadcast Searching searching
multiple databases at one time
For Our End Users, Products Like MuseGlobal Offer Great Promise
Can Handle Any Database, Merge Results into a Single List, Remove
Duplicates
Customized for Each Library, Collection Development is Important