Instructional Workshops ~ Fall Quarter 2000

   

C O N T E N T S :

 
Introduction >>>
 

Search Engines and Internet Guides

 
General Academic Internet Guides
 
Subject Specific Internet Guides
 
Evaluating Sites for Quality
 
Glossary of Terms
 
Instructor Information
 
 

Introduction:

The Intenet is a vast resource -- and it is growing at an ever increasing rate. As a researcher, sifting through the overwhelming number of websites in the search of scholarly information can be a challenging experience. This guide focuses on how to locate information efficiently and evaluate that information for quality of content.

Why the web is difficult to search:

Many factors contribute to the difficulties confronting researchers in finding quality information on the web. Perhaps the biggest factor is the Internet's size and rate of growth. According to recent studies, the Internet is growing at an extraordinary rate, and the large commerical search engines can not keep up with the present growth. Consequently, no one search engine is currently able to index the entire Internet.

Estimated Size:
According to a Cyveillance study released 10 July 2000, the Internet contains over "2 billion unique, publicly accessible pages."

Rate of Growth:
Cyveillance estimates that the Internet will top 4 billion pages within the first few months of 2001. An NEC Research Institute study estimated the size of the Internet at 800 million pages in February 1999. That's an increase of 500% in just 2 years!

Search Engine Coverage:
In the same NEC study, in February 1999, no one search engine covered more than 16% of the web, down 50% from December 1997, when no one search engine was able to index more that 33% of the web.

As of 9 January 2001, Google (one of the largest commercial Internet search engines) claims to have a database of 1,326,920,000 pages. Based on current estimates, the Google database contains less than 33% of the publicly indexable web.

The "Hidden" Web:
A large part of the Internet is not accessible to the commercial search engines. These unidexable pages are collectively known as the "hidden" or "invisible" web. Most of this information is contained within web-accessible databases (such as the NU Library catalog) -- when queried these databases generated web pages "on the fly" which cannot be indexed. Other types of information that cannot be indexed include PDF files, CGI scripts, and Macromedia Flash.

Indexing Bias:
Another impediment to searchability is how web pages are indexed. Indexing biases often yield results that are irrelevant to the researcher of scholarly information. Search engines are more likely to index sites that:

  • ... have more links to them from other sites (i.e., are popular).
  • ... are U.S. sites.
  • ... are commercial sites rather than educational sites.

Information Distribution:
Based on NEC Research Institute data:

  • Commercial content: 83%
  • Scientific and educational content: 6%
  • Pornographic content: 1.5%
   
   

Sources for Internet Statistics:

Lawrence, Steve and C. Lee Giles. "Accessibility of Information on the Web" Nature. Vol 400. 8 July 1999. Pages 107-109.

Lawrence, Steve and Lee Giles. "Accessibility and Distribution of Information on the Web." From the NEC Research Institute website: www.wwwmetrics.com

More, Alvin and Brian H. Murray. "Sizing the Internet." from the Cyveillance website: www.cyveillance.com/newsroom/pressr/000710.asp

For comparisons of the major Internet search engines, especially their relative and total sizes, see Search Engine Showdown: www.searchengineshowdown.com/stats/sizeest.shtml

Another way of measuring the size of the Internet is by estimating the number of hosts -- one website that currently tracks this number is Telcordia Technologies NetSizer: www.netsizer.com

More information on the Hidden or Invisible web can be found at:
www.freepint.co.uk/issues/080600.htm#feature

For Internet trends and analysis see the NUA Internet Surveys:
www.nua.com/surveys