Invisible web? No, we are not talking about the latest spidey invention. Invisible web, a term coined by Mike Bergman, is that part of the WWW (World Wide Web) that has not been indexed by the popular search engines like Google, Yahoo and Bing. Simply put, there are pages and websites which exist on the internet but do not show up on search engines! Yes, Google- nuts out there, you got it right. Google simply CANNOT 'see' the invisible web. How many of us have revered Google as the great Oracle or the all- seeing-eye which scoops every byte of information on web? The truth, however, is in striking contrast with the common perception. The popular search engines can only search up a fraction of the data available on the internet. This fraction is termed as “searchable web” or “surface web”. Here are the cold, hard facts. Google indexes about 8 billion pages on the web. The 'surface web' comprises of about 250 billion pages which shows that Google is able to index only about 3 percent of the surface web. Find your faith on the Google-God wavering? Well then, brace for impact! The 'Invisible Web' or 'Deep Web' which is veiled from the eye is estimated to be about 500 times larger than the surface web and is an ever expanding repository of information. As per Wikipedia estimates, the surface web consists of 167 terabytes which pales in comparison to the invisible web which encompasses a humongous 91,000 terabytes. The web is like an ocean of information, only the surface of which is touched by search engines whereas the multitude of information lies hidden in the chasms, unruffled and untapped.
The power of divination through which search engines find pages is not so divine after all. They merely use robot 'spiders' which crawl on the web indexing information and jumping from one hyperlink to the other thus covering various web-pages. Although these crawlers are able to index a large number of relevant web pages, there are places which are not accessible to search-engine spiders. Think of a webpage which is not linked to any other page on the web, i.e., a page that has no backlinks or inlinks. Traditional search methods cannot index such pages making such pages invisible. There are many private webpages on the internet which require registration or password for data retrieval. Search spiders can reach such doors but they cannot enter as they do not have the required key or password. Also, some webpage creators have do not want their pages to be crawling with search spiders so they add 'meta-tags' in their pages that causes crawlers to avoid that page.. There are many technical hurdles which these crawlers cannot leap. Scripted content, pages using Gopher and FTP protocols (Google uses HTTP protocol), dynamic pages which are returned in response or accessed through forms,contextual webpages are some of those. In a nutshell, there remain huge chunks of information that these spiders cannot wrap their software-y arms around (despite having eight of them) and thus the term Invisible Web or Cloaked Web lingers.
The Invisible Web is not only gigantic in size but it also surpasses the surface web in the quality content of pages. According to expert claims, it contains 550 billion individual documents which cater to informational, research and market demands. It has more focused quality content than the surface web. Moreover, the information revealed in the Invisible Web is free from any form of commercial motives. Non-Profit organisations and research entities which do not enjoy the same levels of advertisement as commercial ventures are often sidelined in traditional search. Most of the original and authoritative information remains cloaked in the form of Deep Web. The Invisible Web is an untapped gold mine of information. The methods to unveil the hidden unfold in the very next lines.
Is it really possible to delve deep into this ocean? Can we expand our horizons beyond the traditional search
engines? Can we 'SEE' what Google cannot?
Yes, we can. There are methods of peeping in the darkest corners of the web. The best way for competitive professional data extraction from the web is to liberally use directories and databases because directories contains large collection of links that enables browsing by choice of subject area .It contains manually evaluated and annotated data in a systematic manner which ensure quality over quantity. The use of databases should be done in a complementary manner with the search engines. Also, when you are looking for dynamically changing information it is good to use search engines that can extract data from the invisible web. There are search engines which can trawl the spaces of the deep web and fish out valuable data.