Duplicate content and SEO
Duplicate content is a real issue for the search engines. While Google announced in July 2008 to have in their index 1000 billion URLs, can you imagine this number in 2020?
Google does not index all the URLs it notices and that is simply because lots of contents are of no interest (empty pages), or even duplicated.
When you work with such a large volume of data and you have to offer relevant results to the whole world in less than half a second, you understand that the hunt for waste is more than necessary.
How do the search engines deal with duplicate content?
Duplicate content is a waste of time, resources, relevance and therefore, a waste of money for the search engines. Moreover, the Internet does not stop growing up and this, more and more rapidly. In order to survive and not end up overwhelmed, the search engines have to make choices and leave the content that they consider as duplicate content, behind.
They can deal with this duplicate content in many ways, it can:
- Be de-indexed
- Be less often crawled
- Be removed from the rankings
The way they deal with it depends on the search engine and other factors, too.
Given the consequences, we could imagine that the search engines have to make sure that they identified the original copy of the content before penalizing you, couldn't we? Actually, this is not really like that considering that their algorithms, to this day, seem incapable of dealing properly with this problem. Here are some criteria that they take into account (or should take into account):
- Similarity of the content with another URL,
- Popularity of the page,
- Authority of the website,
- Presence of a link pointing to the source,
- Date of publication,
- Date of first indexing.
General penalty for duplicate content
Beyond the penalties that an URL may suffer, if a website has been assigned a high rate of duplicate content, then the whole domain can be penalized.
The different types of duplicate content
Two cases of duplicate content can be distinguished:
- The self generated duplicate content: When a website duplicates its own pages on its domain.
- The external duplicate content: When your content is present on another website.
How to avoid duplicate content on your website
Before going to war against duplicate content by unscrupulous webmasters, it is recommended to make sure that your website does not provide itself identical contents under different URLs. Here are the most common cases:
- Accessible content with or without www
- Different internal links to a same content
- Pages with a very poor content: for example, only one line of original content.
- Inbound links (of tracking for example)
- Creation of sessions by the bots
The golden rule to respect is the following one: a document must be displayed under one and only one URL. However, it is not always possible. In this case, it is necessary to set up mechanisms so that robots can index only one URL.
Here come several solutions:
- Use of the file robots.txt
- Implementation of a meta robot noindex
- Deployment of the redirection 301
- Removal of URL through Google Search Console
- meta canonical tag
Hunt for duplicate or stolen content
The use of your content on other website can impact negatively you visibility on search engines. Sometimes, webmasters acting this way are honest and do not imagine the problems that this can create. As for other people, stealing content is a true business. Nowadays, it is possible to use the word 'aggregator' to hide your misdeeds...
The number of Adsense inserts is often a way to distinguish the honest one from the bad robber.
The massive use of the RSS format is, for some people, an authorization to steal content.
Google is an excellent tool to detect plagiarism or every other unauthorized use of your content. Using quotation marks, enter a sentence taken at the heart of your article and check the results.
The website copyscape.com is also an excellent way to ensure the originality of a text.
Google's talk about duplicate content
First of all, you must know that Google does not really mention a penalty for duplicate content but rather 'filters'. I personally have to admit that I don't really see any difference in the end... Google also says that they have efficient algorithms capable of identifying original content, especially if the copy contains a link to the source. Their index proves that all of this is still far from being perfect.
Moreover, the notion of complementary index disappeared from the results pages.
Google also recommends not to worry too much about duplicate content. It's up to you...
Next : Technical SEO
Previous : Image