invisibleweb

The Invisible Web

Invisible web? No, we are not talking about the latest spidey invention. Invisible web, a term coined by Mike Bergman, is that part of the WWW (World Wide Web) that has not been indexed by the popular search engines like Google, Yahoo and Bing. Simply put, there are pages and websites which exist on the internet but do not show up on search engines! Yes, Google- nuts out there, you got it right. Google simply CANNOT 'see' the invisible web. How many of us have revered Google as the great Oracle or the all- seeing-eye which scoops every byte of information on web? The truth, however, is in striking contrast with the common perception. The popular search engines can only search up a fraction of the data available on the internet. This fraction is termed as “searchable web” or “surface web”. Here are the cold, hard facts. Google indexes about 8 billion pages on the web. The 'surface web' comprises of about 250 billion pages which shows that Google is able to index only about 3 percent of the surface web. Find your faith on the Google-God wavering? Well then, brace for impact! The 'Invisible Web' or 'Deep Web' which is veiled from the eye is estimated to be about 500 times larger than the surface web and is an ever expanding repository of information. As per Wikipedia estimates, the surface web consists of 167 terabytes which pales in comparison to the invisible web which encompasses a humongous 91,000 terabytes. The web is like an ocean of information, only the surface of which is touched by search engines whereas the multitude of information lies hidden in the chasms, unruffled and untapped.

What cloaks the invisible web

The power of divination through which search engines find pages is not so divine after all. They merely use robot 'spiders' which crawl on the web indexing information and jumping from one hyperlink to the other thus covering various web-pages. Although these crawlers are able to index a large number of relevant web pages, there are places which are not accessible to search-engine spiders. Think of a webpage which is not linked to any other page on the web, i.e., a page that has no backlinks or inlinks. Traditional search methods cannot index such pages making such pages invisible. There are many private webpages on the internet which require registration or password for data retrieval. Search spiders can reach such doors but they cannot enter as they do not have the required key or password. Also, some webpage creators have do not want their pages to be crawling with search spiders so they add 'meta-tags' in their pages that causes crawlers to avoid that page.. There are many technical hurdles which these crawlers cannot leap. Scripted content, pages using Gopher and FTP protocols (Google uses HTTP protocol), dynamic pages which are returned in response or accessed through forms,contextual webpages are some of those. In a nutshell, there remain huge chunks of information that these spiders cannot wrap their software-y arms around (despite having eight of them) and thus the term Invisible Web or Cloaked Web lingers.

Goldmine of info or just useless junk?

The Invisible Web is not only gigantic in size but it also surpasses the surface web in the quality content of pages. According to expert claims, it contains 550 billion individual documents which cater to informational, research and market demands. It has more focused quality content than the surface web. Moreover, the information revealed in the Invisible Web is free from any form of commercial motives. Non-Profit organisations and research entities which do not enjoy the same levels of advertisement as commercial ventures are often sidelined in traditional search. Most of the original and authoritative information remains cloaked in the form of Deep Web. The Invisible Web is an untapped gold mine of information. The methods to unveil the hidden unfold in the very next lines.

Seek and ye will find

Is it really possible to delve deep into this ocean? Can we expand our horizons beyond the traditional search engines? Can we 'SEE' what Google cannot?
Yes, we can. There are methods of peeping in the darkest corners of the web. The best way for competitive professional data extraction from the web is to liberally use directories and databases because directories contains large collection of links that enables browsing by choice of subject area .It contains manually evaluated and annotated data in a systematic manner which ensure quality over quantity. The use of databases should be done in a complementary manner with the search engines. Also, when you are looking for dynamically changing information it is good to use search engines that can extract data from the invisible web. There are search engines which can trawl the spaces of the deep web and fish out valuable data.

Some of the useful ones are listed below:

· DeepPeep: It aims to extract data from databases using forms querying for information. Auto, Airfare, Biology, Book, Hotel, Job, and Rental are the basic domains covered by it.

· Scirus: It is strictly used for science oriented results and indexes over 450 million science related pages. It has been successful in indexing a large number of journals , scientists' homepages , patents, scholarly reports and articles.

· CompletePlanet: CompletePlanet calls itself “the front door to the deep web”. It gives you access over 70,000 databases which are searchable under various categories and is updated frequently.

· Gigablast: It is an upcoming search engine and indexes over 200 billion pages (Google indexes only about 8 billion pages). It also possesses the ability to index nom-HTML files like excel files, word files and pdf documents.

· The WWW Virtual Library: This is one of the oldest catalogs on the web and lists a lot of information under various categories.

· Infomine: Infomine comprises of a pool of libraries in the United States. University of California, University of Detroit, Wake Forest University, California State University are some of the prominent ones. Searchers can search by the category they are interested in.

· IncyWincy: It is a meta-search engine which uses the results of other search engines and filters them to provide search results. It searches the web, directory, forms, and images.

· LexiBot: It makes multiple queries and effectively searches the invisible web.

· DeepWebTech: It has five search engines which cover the field of business, medicine and science. It searches the underlying databases in the invisible web for data.

· Searchability: It enlists various subject-specific search engines.

Hoping that these open new libraries of information to you!