The Permission-Based Indexing Solution
Submission, Website Ownership Verification, and Crawl

The three steps outlined below will create a new methodology, hereafter called Permission-based indexing, for search engine inclusion.

It should be noted that many of the items listed within these steps assume that the webmaster will be required to provide an email address, as webmasters are to access the existing Google Sitemaps service. However, it is not required to index the site. It would simply be required for webmasters to take advantage of the various features and possible features that permission-based indexing avails.

Submission

The submission step itself would change slightly. A webmaster who wishes to have a site included in the Google index would be able to submit a site to the engine via a standard form.

However, the webmaster would only be able to submit the home page of the site and the site would be subsequently crawled as it normally would upon verification of ownership on the part of the webmaster.

Site Ownership Verification

In order to view Google statistics and information about a website from within the Google Sitemaps website, a Sitemap is not required: however, ownership verification is required. The screen capture hyperlink below illustrates how a webmaster can select one of two ways to verify that he/she is at least partly responsible for the content of a website.

The webmaster can either upload a special HTML file with a Google-specified random name, or a special META tag with a Google-specified random value. The random nature allows Google to quickly determine whether or not the webmaster does indeed have some access to the site in question and whether or not he/she has the ability to indicate which pages of a website should and should not be indexed.

View the screen capture. This screen capture is 937 pixels x 678 pixels wide, so it is best viewed at 1024 x 768 resolution.

In the second screen capture hyperlink below, I have elected to submit an HTML file (since the searchenginefriendlylayouts.com website layout is contained within its own file, a META tag didn’t suit my particular purposes). Google has created a random filename, which I have highlighted in yellow in the screen capture, and I have chosen to create it and upload it to my server.

This file can contain any information but it must be an HTML file. In addition, the webmaster must indicate by checking both boxes, as I have done, that the file has been created and uploaded. These boxes are both unchecked by default.

View the screen capture. This screen capture is 937 pixels x 678 pixels wide, so it is best viewed at 1024 x 768 resolution.

Once verified, the webmaster can submit a sitemap; view Google-related statistics pertaining to queries used to find and visit the site; and receive diagnostic reports pertaining to the indexing of a site and its related Sitemap(s).

The verification technology outlined above, sans sitemap, can be used by Google as a means of obtaining permission from webmasters to crawl, index, and deindex websites. Webmasters would be required to verify their desire to be included or not included in the index, as well as indicate to Google that they want their site crawled without being required to submit a Sitemap.

Crawling

Crawling of internal hyperlinks would take place as per the present standard until all appropriate pages that comprise a site are crawled and indexed. Only those pages that contain either a direct or indirect hyperlink would be crawled and/or indexed; pages that are “orphaned” and have no inbound links directed to them would not be indexed at all.

Should a webmaster choose to do so, the Robots Exclusion Protocol will allow him/her to elect not to allow certain portions of the site to be indexed, regardless of internal hyperlink structure. Assigning a rel=”nofollow” attribute to each appropriate hyperlink will assist in the assurance that the Googlebot will not any links that are not intended to be crawled.

Crawling of external hyperlinks would depend on the individual hyperlinks themselves. External hyperlinks to pages contained within the natural structure of websites which webmasters have verified ownership of and would allow to be indexed would be crawled and reindexed accordingly. External links for which no verification exists or for which deindexing verification exists will not be crawled and indexed.

Reindexing and Orphaned Hyperlinks

Since websites do change over time, webmasters may elect to remove and/or relocate individual web pages within a site, or may choose to remove any hyperlinks to said pages. This creates a scenario where web pages end up orphaned, as they have no inbound links pointing to them.

When this scenario occurs, it is proposed that these pages be moved outside of the main index and into a special Orphaned index for a period of 30-60 days. Webmasters would be able to use the site:(domain) command to determine if there are any orphaned pages, and reassign inbound links to them as necessary. If hyperlinks are not restored to the orphaned pages within the 30-60 day period, then these links would be removed entirely upon the next crawl.

Subdomains and Subdirectories

Since many sites are contained within subdomains and subdirectories, each site that contains no independent hyperlink from its parent domain will be considered a separate site and, for the purposes of submission, will be treated independently from the parent domain. This would ensure that sites hosted on free servers such as Geocities and Angelfire are indexed as well.

Supplemental Index

The supplemental index would only be used in cases where crawling of an individual page would not be completed due to a technical error. Supplemental results would indicate a problem of some sort to the webmaster, such as an excessively long URL or some other technical difficulty. Pages that are part of the ordinary site structure and that are able to be crawled without issue would be crawled and fully indexed as per the accepted standard.

Inbound/Outbound Links and the Ranking Algorithm

Since inbound and outbound links to sites that wish to remain in the index are still intended to be traversable, they could also remain in use as a factor in ranking sites for the purposes of presenting results. The difference would pertain to those links to sites which are not included in the index; the algorithm would no longer apply to said links.

Grace Period

Since the idea of permission-based indexing and/or deindexing represents a significant change in the way search engines gather and index content, a grace period would be necessary to minimize the reduction in size of a search engine database and corresponding reduction in results. During this grace period, the existing index of crawled websites would remain intact. However, webmasters of new websites would be required to submit their sites using the new permission-based indexing methodology. In addition, webmasters of existing websites would also be required to resubmit their sites using the permission-based indexing method in order to keep their sites in the index.

It is suggested that this grace period be no less than 90 days and no greater than 180 days, in order to allow webmasters sufficient time to adapt to the new method.

Furthermore, it is suggested that, if permission-based indexing is to be enabled, that it be enabled as soon as possible. The larger a search engine grows in size, the more difficult it will be to allow webmasters sufficient time to adapt to the permission-based indexing methodology.

06/12/2006

Advantages of Permission-Based Indexing –>

The Permission-Based Indexing SolutionSubmission, Website Ownership Verification, and Crawl