Correct work with duplicate pages. How to get rid of duplicate pages. Removing comment anchors #comment

The owner may not even suspect that some pages on his site have copies - most often this is the case. The pages open, everything is fine with their content, but if you just pay attention to the URL, you will notice that the addresses are different for the same content. What does it mean? For live users, absolutely nothing, since they are interested in the information on the pages, but soulless search engines perceive this phenomenon completely differently - for them these are completely different pages with the same content.

Are duplicate pages harmful?

So, if an ordinary user cannot even notice the presence of duplicates on your site, then search engines will immediately determine this. What reaction should you expect from them? Since search robots essentially see copies as different pages, the content on them ceases to be unique. And this already has a negative impact on rankings.

Also, the presence of duplicates blurs the link juice that the optimizer tried to concentrate on the landing page. Due to duplicates, it may end up on a completely different page than they wanted to move it to. That is, the effect of internal linking and external links can be greatly reduced.

In the vast majority of cases, the CMS is to blame for the occurrence of duplicates - due to correct settings and lack of proper attention by the optimizer, clear copies are generated. This is the problem with many CMSs, for example Joomla. It is difficult to find a universal recipe to solve the problem, but you can try using one of the plugins for deleting copies.

The occurrence of unclear duplicates, in which the content is not completely identical, is usually due to the fault of the webmaster. Such pages are often found on online store sites, where pages with product cards differ only in a few sentences with a description, and all the rest of the content, consisting of end-to-end blocks and other elements, is the same.

Many experts argue that a small number of duplicates will not harm the site, but if there are more than 40-50%, then the resource may face serious difficulties during promotion. In any case, even if there are not many copies, it is worth taking care of them, so you are guaranteed to get rid of problems with duplicates.

Finding Copy Pages

There are several ways to find duplicate pages, but first you should contact several search engines and see how they see your site - you just need to compare the number of pages in the index of each. This is quite simple to do without resorting to any additional funds: in Yandex or Google, just enter host:yoursite.ru in the search bar and look at the number of results.

If, after such a simple check, the quantity differs greatly, by 10-20 times, then this, with some degree of probability, may indicate the content of duplicates in one of them. Copy pages may not be to blame for this difference, but nevertheless it gives rise to a further more thorough search. If the site is small, then you can manually count the number of real pages and then compare with indicators from search engines.

You can search for duplicate pages by URL in the search engine results. If they must have CNC, then pages with URLs containing incomprehensible characters, like “index.php?s=0f6b2903d”, will immediately stand out from the general list.

Another way to determine the presence of duplicates using search engines is to search through text fragments. The procedure for such a check is simple: you need to enter a text fragment of 10-15 words from each page into the search bar, and then analyze the result. If there are two or more pages in the search results, then there are copies, but if there is only one result, then this page has no duplicates, and you don’t have to worry.

It is logical that if the site consists of a large number of pages, then such a check can turn into an impossible task for the optimizer. To minimize time costs, you can use special programs. One of these tools, which is probably familiar to experienced professionals, is the Xenu`s Link Sleuth program.

To check the site, you need to open a new project by selecting “Check URL” from the “File” menu, enter the address and click “OK”. After this, the program will begin processing all site URLs. Upon completion of the verification, you need to export the received data to any convenient editor and start looking for duplicates.

In addition to the above methods, the Yandex.Webmaster and Google Webmaster Tools panels have tools for checking page indexing that can be used to search for duplicates.

Methods for solving the problem

After all duplicates have been found, they will need to be eliminated. This can also be done in several ways, but each specific case requires its own method, and it is possible that you will have to use all of them.

Copy pages can be deleted manually, but this method is most likely only suitable for those duplicates that were created manually due to the negligence of the webmaster.

The 301 redirect is great for merging copy pages whose URLs differ in the presence and absence of www.

The solution to the problem with duplicates using the canonical tag can be used for unclear copies. For example, for product categories in an online store that have duplicates that differ in sorting according to various parameters. Canonical is also suitable for print versions of pages and other similar cases. It is applied quite simply - the rel=”canonical” attribute is specified for all copies, but not for the main page, which is the most relevant. The code should look something like this: link rel="canonical" href="http://yoursite.ru/stranica-kopiya"/, and be within the head tag.

Setting up the robots.txt file can help in the fight against duplicates. The Disallow directive will block access to duplicates for search robots. You can read more about the syntax of this file in issue No. 64 of our newsletter.

conclusions

If users perceive duplicates as one page with different addresses, then for spiders these are different pages with duplicate content. Copy pages are one of the most common pitfalls that beginners cannot get around. Their presence in large quantities on a promoted site is unacceptable, as they create serious obstacles to reaching the TOP.

Hi all! In the last article, we touched on an important topic - searching for duplicate website pages. As the comments and several letters that came to me showed, this topic is relevant. Duplicate content on our blogs, technical CMS flaws and various templates jambs do not allow our resources complete freedom in search engines. Therefore, we have to seriously fight them. In this article we will learn how to remove duplicate pages from any website; examples in this guide will show how to get rid of them in a simple way. We are simply required to use the knowledge gained and monitor subsequent changes in search engine indexes.

My story of fighting duplicates

Before we look at ways to eliminate duplicates, I will tell you my story of dealing with duplicates.

Two years ago (May 25, 2012) I received a training blog for the SE0 specialist courses. It was given to me in order to practice the acquired knowledge during my studies. As a result, in two months of practice I managed to produce a couple of pages, a dozen posts, a bunch of tags and a carload of duplicates. Over the next six months, when the educational blog became my personal website, other duplicates were added to this composition in the Google index. This happened due to the fault of replytocom due to the growing number of comments. But in the Yandex database, the number of indexed pages grew gradually.

At the beginning of 2013, I noticed a specific decline in the positions of my blog in Google. Then I started wondering why this was happening. In the end, I got to the point where I discovered a large number of duplicates in this search engine. Of course, I began to look for options to eliminate them. But my searches for information did not lead to anything - I did not find any sensible manuals on the Internet for removing duplicate pages. But I was able to see one note on one blog about how you can remove duplicates from the index using the robots.txt file.

First of all, I wrote a bunch of prohibiting directives for Yandex and Google to prohibit scanning of certain duplicate pages. Then, in the middle of summer 2013, I used one method for removing duplicates from the Google index (you will learn about it in this article). By that time, the index of this search engine had accumulated more than 6,000 duplicates! And this is with only five pages and more than 120 posts on your blog...

After I implemented my method of removing duplicates, their number began to rapidly decrease. Earlier this year, I used another option to remove duplicates to speed up the process (you will also learn about it). And now on my blog the number of pages in the Google index is approaching ideal - today there are about 600 pages in the database. This is 10 times less than it was before!

How to remove duplicate pages - basic methods

There are several different ways to deal with duplicates. Some options allow you to prevent the appearance of new duplicates, while others can get rid of old ones. Of course, the most the best option- it's manual. But to implement it, you need to have a good understanding of the CMS of your website and know how search engine algorithms work. But other methods are also good and do not require specialized knowledge. We'll talk about them now.

This method is considered the most effective, but also the most demanding in terms of programming knowledge. The fact is that the necessary rules are written here in the .htaccess file (located in the root of the site directory). And if they are written with an error, then you may not only fail to solve the task of removing duplicates, but also remove the entire site from the Internet altogether.

How is the problem of removing duplicates solved using a 301 redirect? It is based on the concept of redirecting search robots from one page (from a duplicate) to another (the original). That is, the robot comes to a duplicate of some page and, using a redirect, appears on the original site document we need. Then he begins to study it, skipping a take outside his field of vision.

Over time, after registering all the variants of this redirect, identical pages are glued together and duplicates eventually fall out of the index. Therefore, this option perfectly cleans previously indexed duplicate pages. If you decide to use this method, be sure to study the syntax for creating redirects before adding rules to the .htaccess file. For example, I recommend studying a guide on the 301st redirect from Sasha Alaev.

Creating a canonical page

This method is used to indicate to the search engine the document from the entire set of its duplicates that should be in the main index. That is, such a page is considered original and participates in search results.

To create it, you need to write a code with the URL of the original document on all duplicate pages:

Of course, it’s cumbersome to write all this manually. There are various plugins for this. For example, for my blog, which runs on the WordPress engine, I specified this code using the “All in One SEO Pack” plugin. This is done very simply - check the appropriate box in the plugin settings:

Unfortunately, the canonical page option does not remove duplicate pages, but only prevents their further appearance. In order to get rid of already indexed duplicates, you can use the following method.

Disallow directive in robots.txt

The robots.txt file is an instruction to search engines that tells them how to index our site. Without this file, a search robot can reach almost all documents on our resource. But such freedom search spider we don’t need it - we don’t want to see all pages in the index. This is especially true for duplicates that appear due to the inadequacy of the site template or our mistakes.

That is why such a file was created in which various directives for prohibiting and allowing indexing by search engines are prescribed. You can prevent scanning of duplicate pages using the Disallow directive:

When creating a directive, you also need to correctly draft the prohibition. After all, if you make a mistake when filling out the rules, then the result may be a completely different page blocking. Thus, we can limit access to the necessary pages and allow other duplicates to leak out. But still, the errors here are not as bad as when creating redirect rules in .htaccess.

The ban on indexing using Disallow applies to all robots. But not for everyone, these bans allow the search engine to remove prohibited pages from the index. For example, Yandex eventually removes duplicate pages blocked in robots.txt.

But Google will not clear its index of unnecessary junk that the webmaster indicated. In addition, the Disallow directive does not guarantee this blocking. If there are external links to pages prohibited in the instructions, they will eventually appear in the Google database .

Getting rid of duplicates indexed in Yandex and Google

So, with various methods figured it out, it's time to find out a step-by-step plan for removing duplicates in Yandex and Google. Before cleaning, you need to find all duplicate pages - I wrote about this in a previous article. You need to see before your eyes which elements of page addresses are reflected in duplicates. For example, if these are pages with tree comments or pagination, then we record the words “replytocom” and “page” in their addresses:

I note that in the case of replytocom you can use not this phrase, but simply a question mark. After all, it is always present in the address of tree comment pages. But then you need to remember that the URLs of the original new pages should not contain the “?” symbol, otherwise these pages will also be banned.

Cleaning Yandex

To remove duplicates in Yandex, we create rules for blocking duplicates using the Disallow directive. To do this, we perform the following actions:

Open the special tool “Robot.txt Analysis” in Yandex Webmaster.
We are adding new rules for blocking duplicate pages to the directives field.
In the “URL list” field we enter examples of duplicate addresses for the new directives.
Click the “Check” button and analyze the results.

If we did everything correctly, then this tool will show that there is a blocking according to the new rules. In the special field “URL check results” we should see a red inscription about the ban:

After checking, we must send the created duplicate directives to the real robots.txt file and rewrite it in the directory of our site. And then we just need to wait until Yandex automatically scrapes our duplicates from its index.

Cleaning Google

It's not that simple with Google. Forbidden directives in robots.txt do not remove duplicates in the index of this search engine. Therefore, we will have to do everything on our own. Fortunately, there is an excellent Google Webmaster service for this. Specifically, we are interested in its “URL Parameters” tool.

It is thanks to this tool that Google allows the site owner to provide the search engine with information about how it needs to process certain parameters in the URL. We are interested in the opportunity to show Google those parameters of addresses whose pages are duplicates. And these are the ones we want to remove from the index. Here's what we need to do for this (for example, let's add a parameter to remove duplicates from replytocom):

Open the “URL Options” tool in the Google service from the “Crawling” menu section.
Click the “Add parameter” button, fill out the form and save the new parameter:

As a result, we get a written rule for Google to review its index for the presence of duplicate pages. Thus, we further specify the following parameters for other duplicates that we want to get rid of. For example, this is what part of my list looks like with written rules for Google so that it adjusts its index:

This concludes our work on cleaning Google, and my post has come to an end. I hope this article will bring you practical benefit and allow you to get rid of duplicate pages of your resources.

Sincerely, Your Maxim Dovzhenko

P.S. Friends, if you need to make a video on this topic, write to me in the comments to this article.

Duplicate pages on websites or blogs, where they come from and what problems they can create.
This is exactly what we will talk about in this post, we will try to understand this phenomenon and find ways to minimize the potential troubles that duplicate pages on the site can bring us.

So let's continue.

What are duplicate pages?

Duplicate pages on any web resource means access to the same information at different addresses. Such pages are also called internal site duplicates.

If the texts on the page are completely identical, then such duplicates are called complete or clear. If there is a partial match takes are called incomplete or unclear.

Incomplete takes– these are category pages, product list pages, and similar pages containing announcements of site materials.

Complete duplicate pages– these are printable versions, versions of pages with different extensions, archive pages, website searches, pages with comments, etc.

Sources of duplicate pages.

On this moment most page duplicates are generated when using modern CMS– content management systems, also called website engines.

This and WordPress and Joomla and DLE and other popular CMS. This phenomenon seriously strains website optimizers and webmasters and causes them additional trouble.

In online stores duplicates may appear when displaying products sorted by various details (manufacturer of the product, purpose of the product, date of manufacture, price, etc.).

We must also remember the notorious WWW prefix and decide whether to use it in the domain name when creating, developing, promoting and promoting the site.

As you can see, the sources of duplicates can be different, I have listed only the main ones, but they are all well known to specialists.

Duplicate pages are negative.

Despite the fact that many people do not pay much attention to the appearance of duplicates, this phenomenon can create serious problems with website promotion.

The search engine may consider duplicates are like spam and, as a result, seriously reduce the position of both these pages and the site as a whole.

When promoting a site with links, the following situation may arise. At some point, the search engine will consider the most relevant duplicate page, and not the one you promote with links and all your efforts and expenses will be in vain.

But there are people who try use duplicates to gain weight to the necessary pages, the main page, for example, or any other.

Methods for dealing with duplicate pages

How to avoid duplicates or how to eliminate negative aspects when they appear?
And in general, is it worth fighting this somehow or leaving everything at the mercy of search engines. Let them figure it out themselves, since they are so smart.

Using robots.txt

Robots.txt– this is a file located in the root directory of our site and containing directives for search robots.

In these directives, we specify which pages on our site to index and which not. We can also specify the name of the main domain of the site and the file containing the site map.

To prevent page indexing the Disallow directive is used. This is what webmasters use to block duplicate pages from indexing, and not only duplicates, but any other information not directly related to the content of the pages. For example:

Disallow: /search/ - close search pages on the site
Disallow: /*? — close pages containing the question mark “?”
Disallow: /20* — close archive pages

Using the .htaccess file

File.htaccess(without extension) is also located in the root directory of the site. To combat duplicates, this file is configured to use 301 redirects.
This method helps well to maintain site performance when changing the CMS of the site or changing its structure. The result is correct redirection without loss of link mass. In this case, the weight of the page at the old address will be transferred to the page at the new address.
301 redirects are also used when determining the main domain of a site - with WWW or without WWW.

Using the REL tag = “CANNONICAL”

Using this tag, the webmaster indicates to the search engine the original source, that is, the page that should be indexed and take part in the ranking of search engines. The page is usually called canonical. The entry in HTML code will look like in the following way:

When using CMS WordPress, this can be done in the settings of such a useful plugin like All in One Seo Pack.

Additional anti-duplicate measures for CMS WordPress

Having applied all of the above methods of dealing with duplicate pages on my blog, I always had the feeling that I had not done everything that was possible. Therefore, after rummaging around on the Internet and consulting with professionals, I decided to do something else. I'll describe it now.

I decided to eliminate the duplicates that are created on the blog when using anchors I talked about them in the article “HTML Anchors”. On blogs running CMS WordPress, anchors are formed when the tag is applied "#more" and when using comments. The expediency of their use is quite controversial, but they clearly produce duplicates.
Now how did I fix this problem.

Let's tackle the #more tag first.

I found the file where it is generated. Or rather, they told me.
This is../wp-includes/post-template.php
Then I found a program fragment:

ID)\» class= \»more-link\»>$more_link_text", $more_link_text);

The fragment marked in red was removed

#more-($post->ID)\» class=

And I ended up with a line like this.

$output .= apply_filters(‘the_content_more_link’, ‘ $more_link_text", $more_link_text);

Removing comment anchors #comment

Now let's move on to the comments. I already thought of that myself.
I also decided on the file ../wp-includes/comment-template.php
Finding the required piece of program code

return apply_filters('get_comment_link', $link . ‘#comment-‘ . $comment->comment_ID, $comment, $args);)

Similarly, the fragment marked in red was removed. Very neat, attentive, right down to every point.

. ‘#comment-‘ . $comment->comment_ID

In the end we get next line program code.

return apply_filters('get_comment_link', $link, $comment, $args);
}

Naturally, I did all this after copying the indicated program files to your computer so that in case of failure you can easily restore the state to the changes.

As a result of these changes, when I click on the text “Read the rest of the entry...”, I get a page with the canonical address and without adding a tail to the address in the form of “#more-...”. Also, when I click on a comment, I get a normal canonical address without a prefix in the form “#comment-...”.

Thus, the number of duplicate pages on the site has decreased slightly. But I can’t say now what else our WordPress will form. We will monitor the problem further.

And in conclusion, I bring to your attention a very good and educational video on this topic. I highly recommend watching it.

Health and success to all. Until next time.

Useful Materials:

Duplicate pages are one of the many reasons for lower positions in search results and even falling under the filter. To prevent this, you need to prevent them from getting into the search engine index.

You can determine the presence of duplicates on the site and get rid of them different ways, but the seriousness of the problem is that duplicates are not always useless pages, they just should not be in the index.

We will now solve this problem, but first we will find out what duplicates are and how they arise.

What are duplicate pages

Duplicate pages are a copy of the content of the canonical (main) page, but with a different url. It is important to note here that they can be either complete or partial.

Full duplication is an exact copy, but with its own address, the difference of which can be manifested in a slash, the abbreviation www, substitution of parameters index.php?, page=1, page/1, etc.

Partial duplication manifests itself in incomplete copying of content and is associated with the structure of the site, when article catalog announcements, archives, content from the sidebar, pagination pages and other end-to-end elements of the resource contained on the canonical page are indexed. This is inherent in most CMS and online stores, in which the presence of a catalog is an integral part of the structure.

We have already talked about the consequences of the occurrence of duplicates, and this happens due to the distribution of reference mass between duplicates, substitution of pages in the index, loss of uniqueness of content, etc.

How to find duplicate pages on a website

To find duplicates, you can use the following methods:

Google search bar. Using the construction site:myblog.ru, where myblog.ru is your url, pages from the main index are identified. To see duplicates, you need to go to last page search results and click on the line “show hidden results”;
"Advanced Search" command in Yandex. By indicating the address of your site in a special window and entering in quotation marks one of the sentences of the indexed article being checked, we should get only one result. If there are more of them, these are duplicates;
toolbar for webmasters in PS;
manually, inserting slash, www, html, asp, php, upper and lower case letters into the address bar. In all cases, redirection should occur to the page with the main address;
special programs and services: Xenu, MegaIndex, etc.

Removing duplicate pages

There are also several ways to eliminate duplicates. Each of them has its impact and consequences, so there is no need to talk about the most effective. It should be remembered that physical destruction of the indexed duplicate is not a solution: search engines will still remember about it. Therefore, the best method of dealing with duplicates is preventing their occurrence using the correct settings for the site.

Here are some of the ways to eliminate duplicates:

Setting up Robots.txt. This will allow you to block certain pages from indexing. But if Yandex robots are susceptible to this file, then Google even captures pages that are closed by it, without really taking into account its recommendations. Additionally, it is very difficult to remove indexed duplicates using Robots.txt;
301 redirect. It helps to merge takes with the canonical page. The method works, but is not always useful. It cannot be used in cases where duplicates should remain independent pages, but should not be indexed;
Assigning a 404 error indexed duplicates. The method is very good for removing them, but it will take some time before the effect appears.

When you can’t glue anything together or delete anything, but you don’t want to lose page weight and get punished by search engines, then you can use rel canonical href attribute.

The rel canonical attribute to combat duplicates

I'll start with an example. The online store has two pages with product cards of identical content, but on one the products are arranged in alphabetical order, and on the other by cost. Both are needed and redirection is not allowed. At the same time, for search engines this is a clear double.

In this case, it is rational to use the tag link rel canonical, which points to a canonical page that is indexed, but the non-primary page remains available to users.

This is done as follows: in the head block of the code of duplicate pages, a link is indicated “link rel=”canonical” href=”http://site.ru/osnovnaya stranitsa”/”, where stranitsa is the address of the canonical page.

With this approach, the user can freely visit any page of the site, but the robot, having read the rel canonical attribute in the code, will go to index only the one whose address is indicated in the link.

This attribute can be useful and for pages with pagination. In this case, they create a “Show all” page (a sort of “footcloth”) and take it as canonical, and the pagination pages send the robot to it via rel canonical.

Thus, the choice of method to combat page duplication depends on the nature of their occurrence and necessity presence on the site.

Duplicates of site pages, their impact on search engine optimization. Manual and automated methods for detecting and eliminating duplicate pages.

The influence of duplicates on website promotion

The presence of duplicates negatively affects the ranking of the site. As stated above, search engines see the original page and its duplicate as two individual pages. Content duplicated on another page ceases to be unique. In addition, the link weight of the duplicated page is lost, since the link can transfer not to the target page, but to its duplicate. This applies to both internal linking and external links.

According to some webmasters, a small number of duplicate pages in general will not cause serious harm to the site, but if their number is close to 40-50% of the total site volume, serious difficulties in promotion are inevitable.

Reasons for duplicates

Most often, duplicates appear as a result of incorrect settings of individual CMSs. The engine's internal scripts begin to work incorrectly and generate copies of site pages.

The phenomenon of fuzzy duplicates is also known - pages whose content is only partially identical. Such duplicates arise, most often, through the fault of the webmaster himself. This phenomenon is typical for online stores, where product card pages are built according to the same template, and ultimately differ from each other by only a few lines of text.

Methods for finding duplicate pages

There are several ways to detect duplicate pages. You can turn to search engines: to do this in Google or Yandex, enter a command like “site:sitename.ru” into the search bar, where sitename.ru is the domain of your site. The search engine will return all indexed pages of the site, and your task will be to detect duplicates.

There is another equally simple way: searching by text fragments. To search in this way, you need to add a small piece of text from your website, 10-15 characters, to the search bar. If the search results for the searched text contain two or more pages of your site, it will not be difficult to detect duplicates.

However, these methods are suitable for sites consisting of a small number of pages. If the site has several hundred or even thousands of pages, then manually searching for duplicates and optimizing the site as a whole becomes impossible tasks. There are special programs for such purposes, for example, one of the most common is Xenu`s Link Sleuth.

In addition, there are special tools for checking the indexing status in the Google Webmaster Tools and Yandex.Webmaster panels. They can also be used to detect duplicates.

Methods for eliminating duplicate pages

Eliminate unnecessary pages can also be done in several ways. Each specific case has its own method, but most often, when optimizing a website, they are used in combination:

removing duplicates manually – suitable if all unnecessary ones were also detected manually;
merging pages using a 301 redirect – suitable if duplicates differ only in the absence and presence of “www” in the URL;
using the “canonical” tag - suitable in case of unclear duplicates (for example, the situation mentioned above with product cards in an online store) and is implemented by entering a code like “link rel="canonical" href="http://sitename.ru/ stranica-kopiya"/" within the head block of duplicate pages;
correct configuration of the robots.txt file - using the “Disallow” directive, you can prohibit duplicate pages from being indexed by search engines.

Conclusion

The occurrence of duplicate pages can become a serious obstacle to optimizing the site and bringing it to the top position, therefore this problem must be addressed at the initial stage of its occurrence.