How to Exclude WordPress Content From Google Search

How to Exclude WordPress Content From Google Search:-Sometimes you need to exclude some content of your website from index in Google search. Indexing or Index before the emergence of search engines like google was a common word associated with books.

Fast forward to 1995, during the internet boom, we have facilities like Yahoo search engine and ocme 1997, Google search has changed how we search & access information  on the internet.

What is Google Indexing?

There are various search engines with different indexing format, but the popular search engines such as Google, Bing and for privacy minded individual, duckduckgo.

Google indexing refers to the procedure of adding new web pages, including digital content such as videos, documents, and images and storing them in database. In simple words for your site content to display in google search results, they first need to be stored in google index.

Google index all the digital pages and content using spiders, crawlers or bots that crawl rapidly websites on internet. These crawlers do follow the website owner’s instructions on what to crawl & what should be ignored.

How to Exclude WordPress Content From Google Search

  1. Using Robots.txt for images: – This is a file located at the root directory of your site providing Bing, Google, Yahoo and other search engine bots with instructions of what to crawl and what not. While generally robots.txt is used to control crawling traffic and web crawlers. It could also be used to prevent images from google search results.

A normal robots.txt file would look like this:

User-Agent:*

Disallow: /wp-admin/

Disallow: /wp-includes/

 

The standard robots.txt file begins with instruction for user-agent, and an asterisk symbol. The asterisk is instruction for all bots that appear on the website to follow the provided instructions.

 

Keep bots away from specific files with robots.txt:-

Robots.txt can be used for stop crawling of digital files on search engine such as PDFs, JPEG or MP4. To block crawling of pdf and jpeg files, this code should be added to robots.txt file.

PDF Files:

User-agent: *

Disallow: /pdfs/ # Block the /pdfs/directory.

Disallow: *.pdf$  # Block pdf files from all bots.

 

Images:-

User-agent: Googlebot- Image

Disallow:  /images/cats.jpg   #block cats.jpg for google specifically.

Please remember that using Robots.txt is not an appropriate option of blocking confidential or sensitive files and content owing to the following limitations:

  • txt only instructs well-behaved crawlers, other bots, non-compliant search engines simply ignore its instructions.
  • txt is easily accessible to anyone who could then read all your instructions updated there and access those content & files directly.
  • txt doesn’t stop server from sending pages & files to unauthorized users upon request.
  • Search engines still find & index the page & content that you block in case they are linked from other sources  & websites.
  1. Using no-index Meta Tag For Pages:- using meta tag “ no-index “ is proper & effect way to block search engine indexing. Unlike robots.txt, no-index meta tag is place in head section of a webpage.

<html>

<head>

<title>….</title>

<meta name=”robots” content=”noindex”>

</head>

Having this code on header section will prevent web pages to show up on search engines.

  1. Using X-Robots-Tag HTTP header for other files:

X-robots-tag gives you flexibility to block search engine spiders to index your content & files. As compared to the no-index meta tag it is used as the HTTP header response for any url.

  1. Using .htaccess Rules for Apache servers:- you can also add X-Robots-Tag HTTP header to .htaccess file to block crawlers from indexing web pages of your site hosted on Apache servers. Unlike meta tags, .htaccess rules can be applied on overall site or a particular folder. Its support of regular expressions offers higher flexibility for target multiple files at once.

To block Bing, Googlebot and Baidu  from crawling a site use the following rules:

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} (googlebot|bingbot|Baiduspider) [NC]

RewriteRule .* –  [R=403, L]

  1. Using page authentication with username & password:-

This method will prevent your confidential data and documents from displaying in search engine result.  However, any user that has the link can reach your data and files directly. For the security purpose, its recommend to setup proper authentication with username & password as well as role permissions.

To do the same, simply set visibility of post or page to password protected. In this way you can select a password to view the content of that page. This is easy to do on per-page/post basis.