Permission-Based Indexing

Questions and Clarifications

There have been a number of questions and comments that have been made pertaining to the concept of permission-based indexing. This page has been created in order to provide further insight and clarification into the concept and how it can help.

Many of the more intelligent suggestions and comments came from this discussion from ihelpyou forums, a very good resource for those of you out there who may wish to build websites that are both user and search-engine-friendly.

The other major source of comments, surprisingly, came from offline sources and via various discussions that I have had in a non-forum capacity with end users.

With that in mind, I will endeavour to address both online and offline comments in this section. Since I received a number of different comments, I will address the questions and comments that were made more than once to me.

Questions/Comments

On June 15, 2006, Paul asked:

Wouldn’t it be easier for search engines to be spammed with this method, since it only requires submission?

Since only the home page can be submitted, no. It would actually be more difficult to spam an engine since techniques such as doorway pages would no longer be effective. All of the pages of a website that would be considered would have to be linked, either directly or indirectly, from the home page.

Permission-based indexing would also eliminate scenarios whereby webmasters set up one domain and use it to promote links to individual scraper pages on a second domain which are not linked to from the home page on the second domain.

There are still methods to potentially spam the system, even with permission-based indexing in place. However, they would be much more difficult as they would require the aforementioned link.

Furthermore, another degree of difficulty would be created if multiple search engines implemented the permission-based indexing system, as pages that do not merit inclusion would have to become in some form a part of the site and pass through multiple search engine spam detection systems.

In other words, permission-based indexing reduces search engine spam by relying only on those links that come from the home page of a site. It’s still possible to submit spam pages, but it becomes much more difficult.

On June 15, 2006, Quadrille asked:

To my mind, most of what you seem to be proposing is already ‘on the table’ – site map verification is a major step to a ‘contract’ between Google and webmasters; I (and others) have long suggested this is step on a continuum towards a formal ‘opt in’ system, where listings will depend 100% on adhering to guidelines.

I do not share your problems with Sitemaps – There’s no problem with webmasters submitting any pages they like – for now; as Google develops the technology to recognise scraped pages more reliably (for example) so they can tighten the rules. Webmasters can withdraw those pages, or be suspended.

Both the points you have mentioned are fair, valid, and merit further clarification.

Point 1: while many of the steps are ‘on the table’ with Google, these same steps do not apply to the other two major engines (MSN and Yahoo!), nor do they apply to any of the second tier/smaller engines, such as Gigablast. While the article does focus primarily on Google, the issues discussed and the solution provided would be just as effective for any other engine.

As time goes on, the number of pages that do not merit inclusion in a search engine’s index for a variety of reasons, both ethical and unethical, will continue to rise, as will the number of different ways in which these pages will continue to be presented. The same issues that Google faces now will eventually occur with other engines as they grow in size.

This goes back to the third purpose of the article: to provide other engines with the exact same solution. While the article focuses primarily on Google technologies, the idea of permission-based indexing could be applied to any other engine as well.

Point 2: I agree completely with the idea that, as webmasters use Sitemaps to submit pages that do not merit engine inclusion, Google’s technology for recognizing these types of pages will increase and webmasters will need to take other measures.

The only issue with Sitemaps from Google’s standpoint is that it allows webmasters to potentially submit large numbers of pages which do not fall under the framework of their primary site, since the Sitemap itself is a file that does not require an inbound hyperlink. These pages could contain a variety of manipulation attempts (such as doorway and scraper pages) coded in a variety of ways that would allow a search engine optimization specialist with the wrong intentions to experiment and see which pages managed to get indexed. The detection and removal of the resulting attempts would take up time and resource that could be better spent indexing sites that deserve inclusion.

The other major issue with Sitemaps is that the XML files are a Google-specific technology, whereas permission-based indexing can be applied universally. While some form of site verification would be necessary, thus creating extra files, there would be no need for an XML-based sitemap to crawl/index a site that is built well enough and is spider-friendly enough to index the links from the home page.

To summarize: permission-based indexing is a more universal solution, requires less resource, relies on the design of a site rather than the pages submitted from an external file. The only external files necessary are site verification files, and an HTML meta tag could be used as well.

Thank you for taking the time to form two intelligent thoughts and providing some insight of your own. It is much appreciated!

Source: http://www.ihelpyou.com/forums/showthread.php?s=&threadid;=22426

On June 15, 2006, Dave Hawley asked:

Adam, it seems the main assumption (not only by you) is that Google is making a conscience choice NOT to index some sites/pages. I don’t believe that to be true as it flies in the face of their mission as stated on their site.

I’m thinking (assuming ) that they simply cannot (at this point in time) index ALL pages out there. The other big 2 don’t seem to be able to either. Perhaps Google, due to their ability to find and reach so many pages, had to come up with a criteria to decide which ones.

I do tend to agree that there is simply too much content for Google, or any other engine, to index every page of every site on the Internet. That is a very good and valid point.

However, the basis behind the assumption (if that is the interpretation) is a statement made by Google engineer Matt Cutts in his blog regarding the reinclusion of a health site. The issue that Cutts specifically cited was that the site only had 6 links pointing to the entire domain.

While I do believe that a webmaster who has developed a site of any value should have no issue asking for and acquiring inbound links to his/her site, the reality of the situation is that a large percentage of webmasters do not know this. They expect, and justifiably so, to be able to submit their sites to a search engine and have them indexed, without any additional effort whatsoever. For example, a mom-and-pop store owner with a 5-page site may simply want to submit it to Google and be found under their own unique company name.

Assuming that the implication of what Matt discussed is true, and I tend to believe it is, then the mom-and-pop site would find Google’s submission form, submit their site, and presumably never be indexed until such time as they had acquired sufficient inbound links. Many such webmasters would find themselves in this particular situation.

There are situations, based on crawling, whereby a site that may not wish to be included is included: (see Inbound Link Issues for more information and scenarios).

Google’s choice to index/deindex sites doesn’t contradict their mission statement, either. To organize the world’s data does not necessarily mean that said data has to be presented fully in an easy-to-use manner. This would be the optimal scenario, but it’s not realistic.

The only requirement of Google to fulfill their mission statement is to crawl the pages that they are made aware of, and file them in at least one of their databases. They don’t have to show all pages to the public, should they choose not to do so.

Selective display of data is not necessarily a negative thing, either. There are a number of scenarios where information stored by a search engine in one or more of their databases would not need to be displayed (again, I refer to the inbound link issues above).

This leads me back to one of the major reasons behind permission-based indexing: it allows the webmasters who own the sites the ability to directly control which content should and should not be indexed. It avoids accidental indexing due to organic hyperlinks provided by well-meaning webmasters of other sites, and it avoids non-indexing of sites operated by webmasters who, for various reasons, do not wish to acquire backlinks.

Source: http://www.ihelpyou.com/forums/showthread.php?s=&threadid;=22426&perpage;=20&pagenumber;=2

Created: 06/14/2006
Last Modified: 06/14/2006

Feedback –>

Permission-Based IndexingQuestions and Clarifications

Questions/Comments

Permission-Based Indexing Navigation

Permission-Based Indexing

Questions and Clarifications