Searching the Visible and Invisible Web

Presentation by Laura Gordon-Murnane, MLS
lgmurnane@yahoo.com
ARLIS Conference,
Baltimore, MD
March 23, 2003

This Page is Available at:
http://tinyurl.com/7ws4

 

Expectations for Searching the Web 

 

The Growing Importance of using the Web for to find Information

A recent report published by the Pew Internet & American Life Project “Counting on the Internet” concluded that “the Internet has become a mainstream information tool.  Its popularity and dependability have raised all Americans’ expectations about the information and services available online.” 

 

The survey revealed that when Americans are looking for information about health care, government information, news, and commerce – about two-thirds of all Americans say that they expect to find such information on the web. 

 

Americans expect to find the information on the Internet – how successful they are in finding what they need is the challenge.

 

------

How do People Find the Information they Want? 

In July of 2002, the Pew Internet Project released a memo stating that to find information on the internet – search engines have become an indispensable utility for internet users. 

 

“Search engines have become an indispensable utility for Internet users. More than eight in ten American Internet users have gone to search engines to find information on the Web. More than one in four U.S. Internet users--about 33 million adults--present queries on search engines on a typical day. Topics can range from the ridiculous ("How many times does my name come up on Google?") to the sublime ("Where was Buddha born?") to the heartbreaking ("My mom has breast cancer--I need information fast"). But the strategies are similar for all these questions--type keywords into a search engine and go from there.”

In fact, the Pew Internet Project has found that search engines are the most popular way to locate a variety of types of information online--including health information, government information and religious information.”

In all, 85% of American Internet users have ever used an online search engine to find information on the Web and 29% of Internet users rely on a search engine on a typical day. (1) Only the act of sending or reading email outranks search-engine queries as an online activity--some 52% of Internet users check e-mail on a typical day.”

 

------

Are they Successful in Finding the Information they Need?

"According to a recent study from iProspect, three-quarters of Internet users use search engines. However, 16 percent of Internet users only look at the first few search results, while 32 percent will read through to the bottom of the first page."

"The study also indicates that 52.1 percent of Internet users choose the same search engine or directory when searching for information, while 35 percent alternate among a number of favorite search properties."

"Only 7.5 percent of Internet users said they refined their search with additional keywords in cases where they were unable to achieve satisfactory results."

Most search engine users only look at the first few results and hope that the information they need is contained in those first few results. 

------

Search Engine Loyalty and the Success of the Searches

The iProspect survey also revealed user allegiance, with more than half (52.1 percent) usually using the same search engine or directory, and almost 35 percent using several different ones interchangeably. Only 13 percent said that they use different search engines for different types of searches.

Interestingly, less than half (45.9 percent) felt that their searches were successful almost all the of the time, but when they are unsuccessful, 27.2 percent of the respondents switch to another search engine, rather than refining the search with more  words (7.5 percent). One-third of the survey participants indicated a success rate three-quarters of the time, and 13.3 percent found what they were looking for half of the time.

------

Conclusions to be Drawn

1) The expectation is that the information is on the internet.

2) Using search engines you will be able to find the information you want.

3) The success rate of these searches is less than 50 percent.

4) Only 13.3 percent found what they were looking for half of the time.

These are not great searching results!

 

Limitations of Search Engines


One of the great benefits of the Internet is that anyone with a connected computer can get useful information without much effort. You just type a topic into Google and up comes a long list of Web pages that might tell you about it.

But as brilliant as Google is, this process has several limitations. First of all, in most cases Google doesn't actually provide you an answer, just a list of links to Web pages where information might be found. So getting the exact information you want requires more steps: You have to browse through the links Google offers, pick out one that looks good, then go to it and look for the relevant material.

Second, you're doing all this in a general, undifferentiated piece of software called a Web browser that isn't designed to help you drill down into information.

Third, neither the browser nor Google gives you a good sense of the credibility of the sources that turn up, just their popularity.” Walter Mossberg Wall Street Journal March 6, 2003.

 

 

Definitions of The Visible and Invisible Web

 

Definition 1: Visible Web


The Visible Web Consists of material found “on the web” that general search engine (Google, Altavista, AlltheWeb, etc.) can find and make searchable and easily accessible.

Definition 2: Invisible Web

Invisible Web consists of material found "on the web" that general search tools (Google, AltaVista, Teoma, etc.)  cannot, will not, or do not crawl/index/make searchable and easily accessible.

 

“On the Web” vs. “Via the Web”

 

Information "On the Web"
Anyone with server access can place just about anything "on it"
Very little bibliographic control
No language control
Quality of info
Cost is low or free
Examples:  Yahoo, House of Representatives
 

Information "Via the Web"
Using an Internet connection and browser to access traditional/commercial databases and resources
Databases not directly searchable via web search tools
Information is highly structured and well indexed
The quality of the information is uniformly high, often professional resources
Invisible Web materials can be free, low cost, or expensive
Proprietary-Cost Can Vary, Often Expensive
In Many Situations This Is Not Realized By The End User
Types of materials
Full-Text of Peer Reviewed Journals
Indexed
Abstracts
Examples: Dialog, Lexis-Nexis, Factiva

Variations To Be Aware Of (Particularly from End User Perspective)
Search Free, No Subscription Cost, Pay-Per-Article

 

How Do “on the Web” Search Engines Work?

Search Engines Consist of Three Parts

1.               The Web Crawler

2.             The Indexer

3.             The Query Processor

Myth – All Search Engines are Alike

1.      All Search Engines are not the Same

2.    Comprehensiveness, currency, coverage

3.    Interface, syntax, capabilities – all unique

4.    Different algorithms – lead to different results

5.    If you don’t find what you need using commercial search engines – could indicate that you need to consider an invisible web tool

Learn more than one Search Engine

 


 

 

Web Crawlers  (Spiders) 

The web crawler or spider find and retrieves web pages. 

The web crawler identifies pages in 2 ways:

a.   Add url form

b.  Harvest hypertext links embedded on the page

Points to consider

1)  Harvesting of links creates a very large pool of pages to visit.  The crawler must determine if the page has already been visited and if it is already in the search engine’s index.  If the url is already in the index, the crawler has to determine if the information is still current or if the information in the search engine’s index is out of date and needs to be updated.

2)Crawling the web - resource-intensive.

3)Do not assume that every search engine will crawl and index the site’s entire set of pages.

 

Search Engines Indexes

1.                Indexes every word on every page and stores in a huge database. 

2.              The search engine stores the full text of the pages – and this allows for the search engine to offer more than just simple single keyword matching.

3.              Offer proximity searching that will match multi-word phrases, sentences and bigger sections of text. 

The Query Processor

1.                Most Complex Piece of the Search Engine

2.              Query Processor has three parts – search form, the engine that processes the request, the results page.

3.              Search Form and Results page are similar for all web search engines.

4.              Key Difference between search engines – the way relevance is calculated

a.   Statistical Analysis of Text

b.  Link Analysis

Myth – Search Engine Indexes are Current 

Remember – Search engines search their index – not the current web. 

Crawling the web is resource intensive – the search engine has to determine how frequently it will recrawl the page.

Many search engines have increased their recrawl rates. 

However, Web is too large for any one search engine to provide comprehensive coverage.  Too many new pages, too expensive to recrawl every new page.  This is for the Visible Web. 

What about the Invisible Web.  Why can’t crawlers find Invisible web pages?

Why Search Engines Can’t Find Pages 

 

 

Technical Issues – Invisible Web

 

A) Opaque Web

1)   Disconnected Pages – no links to find the page

2)   Page not Submitted to an Engine/Engines (this is a secondary way engines learn of new content)

3)   Depth of Crawl – the crawler does not crawl the entire site. (tip: use the search engine at the source for better search results)

4)   Maximum Number of Viewable Results

5)   Size Limitation – Each engine decides how much of the page it will crawl

a.   Google only indexes the first 101k of the page

b.  AllTheWeb  - indexes the entire page

c.   Altavista  - indexes the first 110k of the entire page

6)   Frequency of Crawl – Content on the page is not current

a.   Each engine is different

b.  Many pages have a minimum of 30-45 Days after Discovery

7)   Recrawl Rates – Each Engine is Different

B) Private Web

1)   Robots Exclusion Protocol  (Don't Crawl and Index My Content)

2)   Noindex Meta Tag – specifies specific page or pages the crawler is not supposed to crawl.

3)   Firewall is in place – Only those authorized to gain access are allowed in

4)   Password protected pages

C) Proprietary Web

1)   Password Protection

2)   Firewall – Only those authorized to gain access are allowed in

3)   Use legacy database systems – available long before Web existed

D) Invisible Web

1)   Non-html Text – Search engines are designed to index html text – audio, video, images – hard for search engines to understand (Altavista and hotbot can do limited searching for non-text files)

2)   Multiple Formats – Not every format is crawled by every search engine – pdf (postscript), flash, shockwave, executable programs or compressed files – technically indexable but until recently ignored by search engines. To index these formats is resource-intensive – Google, AllTheWeb, and Altavista are now indexing pdf files and MSN Search has included word, excel, and powerpoint formats.  (Take a look at ResearchIndex (NEC Research Institute) – indexes pdf files – also creates a citation index – easy to locate related documents).

3)   Codes and Frames are difficult for web search engines

a.   Frames are not accessible

b.  Javascript pop-up windows – spiders don’t follow the javascript commands

4)   Registration Forms – the form cannot be completed by the spider – blocking access to the information

a.   Relational Databases 

1.                  Why do web content developers use databases – flexibility, easily maintained

2.                  Web front ends to provide access to proprietary systems that are now open  - these databases are available “via the web”

5)   Dynamically Generated Pages – search engine refuses to crawl any material past the ?  - the spider sees the  ? and stops http://us.imdb.com/Tsearch?title=igby+goes+down&restrict=Movies+only&GO.x=21&GO.y=8

a.   Spider Traps

b.  Sites generated dynamically from a database (.cfm, .asp, .cgi)

6)   Different pages crawled by different engines (no overlap) – No two engines are alike (Lycos and AllTheWeb use the same Database).

7)   Real-Time Content – spiders do not crawl/recrawl in real-time (too much information, no real good reason to spider this type of information – (stock quotes, weather, airline flight arrival/departure information). Trip.com (track flights)

 

 

Social/Cultural Reasons Information on the Web is Invisible 

1)  Many People Only Use a Single Search Tool – learn more than one search tool

2)Internet searchers only look at the first few results

 

Why Use Invisible Web Sources?  

 

1)              Specialized Content Focus - more comprehensive Results – subject specific

2)            Specialized Search Interface – more control over search input and output

3)            Increased precision and recall

4)            Invisible Web Resources – highest level of authority

5)           Answer may not be available anywhere else.

 

   

 

 

 

 

 

 

  

 

 

When to Use Invisible Web Resources?  

 

 

 

 

1)  When you are familiar with a subject.

2)When you are familiar with a specific search tool.

3)When you are looking for a precise answer.

4)When you want authoritative, exhaustive results.

5)           When timeliness of content is an issue.

 

 

 

 

 

 

 

 

 

 

 

 

 

Recent article in the Chronicle of Higher Education “New Allies in the Fight Against Research by Googling

Faculty members and librarians slowly start to work together on courseware” (March 21, 2003).  Introduce students to authoritative sources – not just Google. 

 

n   Google is not the only research tool available

n   Subject Specific Databases

n   Highly authoritative research tools

n    Information is already organized for optimal results

 

 

Invisible Web Resources and Specialized Search Tools  

 

 

Architecture/Art/Museums

 

Metropolitan Museum of Art Online Collection

The Metropolitan Museum's online collection currently includes the entire Department of European Paintings and fifty highlights from each of the Museum's seventeen other curatorial departments, as well as fifty each from the Museum's libraries and from the database of the Antonio Ratti Textile Center. (As digitization of images proceeds, more objects will be made available online.)  Search form: http://www.metmuseum.org/collections/search.asp

National Gallery of Art (Washington D.C.)

Search the entire collection by artist name or title of work. Images are available for many items.
Search form: http://www.nga.gov/search/search.htm

 

Gateway to Information

Art Library Directory (IFLA)

This Directory is provided as a means to access nearly 3,000 libraries and library departments with specialized holdings in art, architecture, and archaeology throughout the world. Data recorded for each institution includes address, telephone and tele-facsimile numbers, hours of operation, annual closings, and listings of professional personnel. It also includes electronic mail addresses of individual librarians and direct web links to institutional home pages. Provided by the IFLA (International Federation of Library Associations and Institutions) Section of Art Libraries.
Search form: See Main Page

 

Other Examples of Invisible Web Resources

 

Public Company Filings

EdgarIQ (Public Company Filings)

One of many interfaces to this SEC EDGAR material. EdgarIQ provides free real-time online access and full-text search of the EDGAR system.

Search form: See Main Page
Other Sources FreeEdgar
Other Sources Securites and Exchange Commission Edgar

 

Telephone Numbers

Anywho.Com (Telephone Directory)

One of many phone directory databases on the Internet. Anywho.com -  residential and business listings. 

 

Customized Maps and Driving Directions

How Far Is It?

This service uses data from the US Census and a supplementary list of cities around the world to find the latitude and longitude of two places, and then calculates the distance between them.
Search form: See Main Page

 

Clinical Trials

ClinicalTrials.gov

The U.S. National Institutes of Health, through its National Library of Medicine, has developed current information about clinical research studies.
Search form

 

Entertainment

Internet Movie Database
Search Form: http://us.imdb.com/search

 

Patents

U.S. Patent Databases (U.S. Patent and Trademark Office)

Numerous searching options including full text and bibliograhic databases. Full text of all US patents issued since January 1, 1976, and full-page images of each page of every US patent issued since 1790.
Search form: See Main Page
Related resource 1: (Australia) Australian Patent Databases
Related resource 2: (Canada) Patent Database
Related resource 3: (U.K.) Patent Search

 

Library Catalogs

Library of Congress Online Catalog

The Library of Congress Online Catalog ( http://catalog.loc.gov/) is a database of approximately 12 million records representing books, serials, computer files, manuscripts, cartographic materials, music, sound recordings, and visual materials in the Library's collections. The Online Catalog also provides references, notes, circulation status, and information about materials still in the acquisitions stage.
Search form: http://catalog.loc.gov/
Related resource 1: Library of Congress Archival Finding Aids

 

 

Examples of Specialized Search Tools

Video/Audio Search
PBS NewsHour with Jim Lehrer  - Keyword Search, Watch Video/List Audio

National Public Radio – Audio Archives
Speechbot - search for radio programs by keyword, searches across radio programs  (uses speech recognition software to create a transcript of the program and then builds an index of the words spoken during the program).
The FeedRoom - Real-Time TV News Text Transcripts
CapitolHearings.org 
- Listen/Watch Senate Hearings – (Live)

Web Archives
The WayBack Machine 

The Wayback Machine, a service from the Internet Archive and Alexa Internet, allows people to access and use archived versions of stored websites. Visitors to the Wayback Machine can type in an URL, select a date, and then begin surfing on an archived version of the web. The Wayback Machine is built so that it can be used and referenced by anybody and everybody.

Cached Pages

Google
Google News
Daypop
Incy Wincy (A small web engine, many pages cached in November, 2002)
Yuntis (An experimental engine from State University of New York, Stonybrook)

Gigablast – another search engine that caches pages

 

9/11/01 Television Archive – Collection of Television News Broadcasts following 9/11

 

Specialized Web Search Resources by Major Web Search Engines

AllTheWeb

AlltheWeb News
Comment: The AlltheWeb News spider crawls 3,000 news sites (both national and international) continuously.  The database is separate from the Web index.

AlltheWeb Multimedia Catalogs
ATW searches millions of videos, images, and sound files.

Pictures
Includes advanced format options
File Formats: (jpg, gif, bmp), Type (color, b&w, line art), Background (transparent and non-transparent)

Videos:
Limits include: Format (Real, QuickTime, AVI) and Stream/Download

Audio Files
MP3 Files
No Advanced Searching Features Available at this time.

 

------------------------------------------------------------------------------

Google

Google Catalogs (BETA)

Simple Search

Advanced Interface: http://catalogs.google.com/advanced_catalog_search

Google has made it easy to find information published in mail-order catalogs that were not previously available online.  Search the full-text from over 4,500 mail-order catalogs from US Companies. Google uses optical character recognition software to scan each page and creates an image file. The OCR software finds keywords embedded in the scanned image files.

Google News
 
Advanced interface: Not Available
Google News presents information culled from approximately 4,500 news sources worldwide and automatically arranged to present the most relevant news first. Topics are updated continuously throughout the day, so you will see new stories each time you check the page. Google has developed an automated grouping process for Google News that pulls together related headlines and photos from thousands of sources worldwide -- enabling you to see how different news organizations are reporting the same story. You pick the item that interests you, then go directly to the site which published the account you wish to read.

Google News is highly unusual in that it offers a news service compiled solely by computer algorithms without human intervention.

You can trace the history of a developing issue by clicking the "sort by date" function on the page containing all reports on a given topic. This will arrange the stories in chronological order, with the most recent report placed first.

Uncle Sam
 

Limits your search to .gov, .mil sites, and some state material. 

Google Images

Google's Image Search consists of more than 425 million images indexed and available for viewing.

Google analyzes the text on the page adjacent to the image, the image caption and dozens of other factors to determine the image content. Google also uses sophisticated algorithms to remove duplicates and ensure that the highest quality images are presented first in your results.

Be careful about copyright issues.  Google does not provide copyright permission.

 

-----------------------------------------------------------------------------------------------------------------------

Altavista

Altavista News
AltaVista gathers news from 3,000 worldwide sources. They receive news feeds from Moreover Technologies and news sites like the New York Times or Forbes, and other news sources. News stories are updated continuously. AltaVista provides news search functionality in Australia, Canada, Germany, India, Ireland, New Zealand, the United Kingdom and the United States.

AltaVista Multimedia Search
Access to approximately 118 million images, videos, and sound files. Again, stay on top of all copyright issues before using this material. Most of the material found in these databases is not directly accessible via the primary AV interface. Some of AV's advanced syntax will work with the multimedia engines. In addition, AltaVista has a few paying partners who provide a direct feed into the database. For example, a search of the video database will find new video content from MSNBC. 

AltaVista Images 
 
Advanced interface
Limits include: Type (color, b&w, banners)

AltaVista Audio
 
Advanced interface
Limits include: Format (mp3, wav, etc.) and Stream/Download, Duration (Less or Greater than 1 Minute)

AltaVista Video
Advanced interface
Limits include: Format (Avi, Quicktime, MPEG, etc.) and Stream/Download, Duration (Less or Greater than 1 Minute)
Comment: If you want video content from events in the news, use the Advanced Video Search interface and limit your search to only MSNBC material. 

 

Teoma
Instead of ranking results based on the sites with the most links leading to them, Teoma analyzes the Web as it is organically organized—in naturally-occurring communities that are about or related to the same subject—to determine which sites are most relevant. Teoma is the only search technology that can locate communities on the Web within their specific subject areas, as they actually exist. And this allows us to finely tune our search process, providing more precise results.

To determine the authority—and thus the overall quality and relevance—of a site's content, Teoma uses Subject-Specific PopularitySM. Subject-Specific Popularity ranks a site based on the number of same-subject pages that reference it, not just general popularity.

Example: new source review

Refine
Teoma organizes sites into naturally occurring communities that are about the subject of each search query. These communities are presented under the heading "Refine" on the Teoma.com results page. This tool allows a user to further focus his or her specific search.

Results
Next, after identifying these communities, Teoma employs a technique called Subject-Specific PopularitySM. Subject-Specific Popularity analyzes the relationship of sites within a community, ranking a site based on the number of same-subject pages that reference it, among hundreds of other criteria.

Resources
Finally, by dividing the Web into local subject communities, Teoma is able to find and identify expert resources about a particular subject. These sites feature lists of other authoritative sites and links relating to the search topic.

 

Keeping Current: Resources to Monitor

Gary Price’s – ResourceShelf.com
Librarians' Index to the Internet
InfoMine 
Marylaine Block’s – Neat New Stuff I found on the Web this Week
Free Pint
Internet Resources Newsletter (Monthly)

The Virtual Chase

LLXR and LLRX Buzz
Scout Report & Scout Report Archive

Searchenginewatch.com

Search Engine Showdown
 

Cool Tools and Tips

Must Have
Google Toolbar  | UltraBar Tool

--------------------------------------------------------------------------------------------------

Google Search Tips

Linked page: Find external pages that point to an URL
Search Syntax:  link:http://www.website.com
Example: link:http://www.nytimes.com 
(who’s linking to whom?)

Restrict your search to a specific site
Search Syntax:
query site:http://www.domain.com
Example: Library of Congress

Similar pages : Find pages that are related to a result (web pages with similar or close contents and topics)
Search Syntax: related:www.website.com
Example: TrekBikes.com related:www.trekbikes.com

 

 

The Need for Specialized Tools & Knowledge of the Invisible Web

·         The Web and General Web Databases will continue to Grow Larger

·         Existing and New Specialized Databases will be released and made available.

·         To improve the chances of finding information – these specialized databases will increase in importance.

·         In Many Cases Specialized Tools, Invisible and Specialized Tools Have Interfaces "Customized" for the Specific Data in the Database – for example – Government Accounting office advanced search. 

·         Ability to Sort, View Data in Ways Specific to the Data Set – use the search tool that is designed to search the database or material – you will have much greater control over the search and hopefully you will have better results

·         Bigger Databases Translates - More Recall, Lower Precision (more pages, not necessarily results that are on target)

·         Focused Databases, Smaller Universe of Materials to Search Through

·         Greater Ability To "Work With the Data" (Sort, Limit, etc.)

·         The Authority of Author Increasing Important, You Know Where It's Coming From

·         In Many Cases, You Don't Start Searching for a Phone Number in an Encyclopedia

·         The Right Tool for the Job – Encyclopedia, Phone Numbers

·         Think Resources, "Learn" Them Like You Learn Traditional Reference Tools

·         LexisNexis, Dialog, Offer Many Databases Depending on the Information You're Looking to Access

·         Deciding Where and What to Search is a Skill That Info Professionals Have

·         Even Larger Databases to Search Through (the databases grow larger)

 

 

Conclusions

·         Be Aware of the Limitations of General Engines

·         No Single Engine Indexes the Entire Web

·         Use More than a Single Engine

·         Even if it "Might" Be Accessible in a General Engine Would a Focused Engine Get it To You More Quickly?
Internet Archive Example (10 Billion Pages Archived)
ImagesCanada ||| Typical Search in Google? Where is the ImagesCanada Database?

·         The Challenge is Learning the many Different Resources available (both Visible and Invisible) and Being Able to Access it Quickly

·         Web Collection Development will be more important
Building a Collection, Knowing What's Available – both visible and invisible web tools

·         Think of Specialized and Invisible Web Tools Like You Think of your Reference Collection

·         Future? Federated Searching, Broadcast Searching – searching multiple databases at one time
For Our End Users, Products Like
MuseGlobal Offer Great Promise
Can Handle Any Database, Merge Results into a Single List, Remove Duplicates
Customized for Each Library, Collection Development is Important