Duplicate content is two or more pages containing the same or very similar text. Duplicate content splits link authority and thus diminishes a page’s ability to rank in organic search results.
Say a website has two identical pages, each with 10 external, inbound links. That site could have harnessed the strength of 20 links to boost the ranking of a single page. Instead, the site has two pages with 10 links. Neither would rank as highly.
Duplicate content also hurts crawl budget and otherwise bloats search engines’ indexes.
Ecommerce sites create duplicate content. It’s a byproduct of platform settings and technology decisions. What follows are two good ways to remove duplicate content from search-engine indexes — and eight to avoid.
Remove Indexed Duplicate Content
To correct indexed, duplicate content, (i) consolidate link authority into a single page and (ii) prompt the search engines to remove the duplicate page from their index. There are two good ways to do this.
- 301 redirects are the best option. 301 redirects consolidate link authority, prompt de-indexation, and also redirect the user to the new page. Google has stated that it assigns 100 percent of the link authority to the new page with a 301 redirect. But Bing and other search engines are tighter lipped. Regardless, use 301 redirects only when the page has been permanently removed.
- Canonical tags. “Canonical” is a fancy word for something that is recognized as the one truth. In search engine optimization, canonical tags identify which page should be indexed and assigned link authority. The tags are suggestions to search engines — not commands like 301 redirects. Search engines typically respect canonical tags for truly duplicate content.
Canonical tags are the next best option when (i) 301 redirects are impractical or (ii) the duplicate page needs to remain accessible — for example, if you have two product grid pages, one sorted high-to-low, and the other low-to-high, you wouldn’t want to redirect one to the other.
8 Methods to Avoid
Some options that remove — or claim to remove — duplicate content from search indexes are not advisable, in my experience.
- 302 redirects signal a temporary move rather than permanent. Google has said for years that 302 redirects pass 100 percent of the link authority. However, 302s do not prompt de-indexation. Since they take the same amount of effort to implement as 301s, 302 redirects should only be used when the redirect is truly temporary and will someday be removed.
- Meta refreshes are visible to shoppers as a brief blip or multisecond page load on their screen before the browser loads a new page. They are a poor choice due to the obnoxious user experience and the rendering time Google needs to process them as redirects.
- 404 errors indicate that the requested file isn’t on the server. While they prompt search engines to remove a page from their indexes, any link authority associated with the page you remove will evaporate. Try to 301 redirect a deleted page when you can.
- Soft 404 errors occur when the server 302 redirects a bad URL to what looks like an error page, which then returns a 200 OK server header response. For example, say example.com/page/ has been removed and should return a 404 error. Instead, it 302 redirects to a page that looks like an error page (such as www.example.com/error-page/), but returns a 200 OK response.
The 302 response inadvertently tells search engines that www.example.com/page/ is gone but might be coming back, so the page should remain indexed. Moreover, the 200 response tells search engines that www.example.com/error-page/ is a valid page for indexing. Soft 404s thus bloat the index even further by resulting in not just one bad URL being indexed, but two.
- Search engine tools. Google and Bing provide tools to remove a URL. However, since both require that the submitted URL returns a valid 404 error, the tools are a backup step after removing the page from your server.
- Meta robots noindex tag is in the head of the HTML file. The noindex attribute tells bots not to index the page. When applied after a page has been indexed, it may eventually result in de-indexation, but that could take months. Unfortunately, link authority dies with the engines’ ability to index the page. And since search engines must continue to crawl a page to verify that the noindex attribute is still in place, this option doesn’t reduce dead-weight pages from the index. (Note, incidentally, that the nofollow attribute of the meta robots tag has no impact on that page’s indexation.)
- Robots.txt disallow does not prompt de-indexation. Pages that are disallowed after they have been indexed are no longer crawled by search engine bots, but they may or may not remain indexed. It’s unlikely that these pages will show up in search results unless searched for by URL, however, because the search engines will no longer crawl the page.
While they’re not ideal for removing indexed content, meta robots noindex and robots.txt disallow should both prevent new duplicate content from being indexed. Their application, however, requires that duplicate content be identified before the launch of a new site, and they are not 100-percent effective.
Your Best Bet
If you need a sure method of de-indexation, a 301 redirect or 404 error is your best bet because the server no longer loads the content that had been found on that page. If you need to de-index the page and harness the link authority, use a 301 redirect.