Part Three: Blocking Crawler Spam On Google Analytics

To view Part One: Ghost and Crawler Spam click here.

To view Part Two: Setting Up Views and Blocking Ghost Spam click here.

Welcome back to Part Three in our series on Google Analytics. In the first week, we started off our series by familiarizing ourselves with the differences between Ghost and Crawler Spam and what not to do when trying to deal with them. If you missed the first week you can find it here: Part One. In Part Two we showed you how to set up views and remove Ghost Spam from your Analytics Reporting. If you missed the second week you can find it here: Part Two.

This week we will discuss how to block Crawlers from your website. Not all Crawlers are bad, and many serve important functions on the web. For example, search engines such as Google use Crawlers to scan webpages for search terms. There are malicious Crawlers however that scour your website for information such as email addresses to send spam to, or weaknesses in your coding that can be taken advantage of. These are the Crawlers we will be blocking.

In order to solve the problem of Crawlers it is important to remember that they work very differently from Ghost Spam. As we discussed in Part One, the fundamental difference between Ghost and Crawler Spam is that Ghost Spam does not access your website whereas Crawler Spam does. because of this, we can not tell Google Analytics to ignore Crawlers in your reporting as we did for Ghost Spam. Instead we will block their access to the site altogether.

We will do this using the .htaccess file.

 

Blocking Crawlers with the .htaccess file

 

As we discussed in week one, the .htaccess file serves many important functions on your website. The most important of these is choosing who to allow access to your web server. In this file we will make changes to block malicious Crawlers from accessing your site. This process is a little more complicated than our Ghost Spam filtering, and must be done very carefully as in errors made while working on the .htaccess file can greatly affect your site. These changes may require additional work in the future as spammers sometimes change their Crawlers in order to make it past blocks. Blocking from the .htaccess file will still help though as it will make it harder for spammers to access your site.

There are two different ways to do this:

  • Block Crawlers based on IP address
  • Block Crawlers based on Name, or “User-Agent Tag

These two methods use different information about the Crawler bot that you can find from your websites records, Google Analytics, and online lists of known bots such as this one: Crawler List.

To find the information from your internal records you will need to download your log files using FTP or File Transfer Protocol.  FTP is a standard network protocol used to transfer computer files between a client and server on a computer network. The location of these log files varies from site to site, so if you have a hard time finding them your host company should be able to tell you where they are stored. Once you have access to your logs you will need to open the file in a text editor to read the information and identify the Crawlers on your site. In addition to the list of known Crawler bots, you can identify malicious bots if they excessively access your site. (Think several times a day, or in more extreme cases, several times an hour.) Once you have identified the bots, record their IP address and Name.

You will then need to download your .htaccess file using FTP as well. Upon downloading the file, immediately make a copy of it to remain unedited. This is extremely important as errors made in editing the .htaccess file’s code could block all access to your site. You will only make changes to one of the files in case a back-up is needed.

Now that you have your .htaccess file and the information of the Crawlers plaguing your site, its time to block them based on either their IP address or Name. We will discuss how to do both.

 

Blocking Crawler Spam Based on IP Address

 

Your websites logs will record the Ip address used by everyone who accesses your site, including Crawlers. You can block specific IP’s by adding the following code to your .htaccessfile:

Order Deny,Allow

Deny from XXX.XX.X.X

In this example, XXX.XX.X.X is used as a placeholder for the IP address you want to block. In order to block your intended Crawler, you will need to replace the X’s with its IP address. Order Deny,Allow simply means that if the web server receives a request to access the site that matches the Deny Rule, it will not be allowed access. If it doesn’t match the Deny Rule then it will be allowed on to the site.

The second line is the Deny Rule. In this example, the Deny Rule is telling the server to deny any requests from the IP address: XXX.XX.X.X. Anything trying to access the site from this IP address will receive an error message telling them they are forbidden from accessing the site.

You can add more IP’s by inserting additional Deny Rules following the first with the same formatting:

Order Deny,Allow

Deny from XXX.XX.X.X

Deny from YYY.YY.Y.Y

Deny from ZZZ.ZZ.Z.Z

That is all that you need to do to block Crawler bots from a specific IP Address. As more Crawlers are discovered and changes are made to existing ones, you will need to update this list from time-to-time. Continue reading to find out how to block Crawlers by Name.

 

Blocking Crawler Spam Based on Name

 

If you already know the Name of the bot you want to block, you can use this information to deny it access to your site. This can be done by adding a segment of code into your .htaccess file that will block based on the requester’s Name, referred to as its “User-Agent Tag.”. This can be more straightforward than the IP address and easier to check your work, but also may require additional maintenance in the future.

In this example we will block some search engine bots. You will most likely never want to do this as it can make your site harder to discover, but we will use it as an example as you should recognize some of the bots’ names. To block based on name enter this segment of code into your .htaccess file:

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} Googlebot [OR]

RewriteCond %{HTTP_USER_AGENT} AdsBot-Google [OR]

RewriteCond %{HTTP_USER_AGENT} msnbot [OR]

RewriteCond %{HTTP_USER_AGENT} AltaVista [OR]

RewriteCond %{HTTP_USER_AGENT} Slurp RewriteRule . – [F,L]

What this code does is take a list of conditions, (RewriteCond) and applies a rule to deny them access much in the same way as the IP address changes we discussed. The F at the end of the code stands for Forbidden and tells the .htaccess file to deny access to any requesters meeting those conditions. The L means it’s the last rule in the set, and should always be inserted following the final condition as shown above.

This block of code can be added to as more Crawlers are discovered, making it easier to deny them access.

 

Finishing Up

 

Once you’ve made the changes and blocked the Crawler bots  you want to, you can save the .htaccess file and upload it to your server, overwriting the original one. Check to make sure that your website is accessible and has no errors as this would be an indicator of something wrong in the coding of the .htaccess file. Depending on which one you chose, you should now be blocking Crawlers based on either their IP Address or Name. Any Additional Crawler bots can be blocked in the same way.

In Closing

This concludes Part Three of our Google Analytics series. After making these changes, you will have now rid your site of Ghost Spam, and at least for the time-being, Crawler spam as well. To keep Crawler bots off your site, you will need to be diligent in checking your reports for suspicious activity.

Invalid traffic means inaccurate reports, and inaccurate reports can cost you money. With both of these spam types blocked from your site, you will be one step closer to having accurate reports and a better understanding of how well your site is running.

We hope that this post has helped you have a better understanding of how spam operates and can affect your website. Next week we will discuss more steps that can be taken to get the most out of your Google Analytics Reporting. In the meantime if you have any questions about Ghost and Crawler Spam or any of Edifice Automotive’s services fill out the contact from below and we will be happy to assist you.

 

Resources

 

Words to Know

  • Ghost Spam- a kind of spaming that involves making repeated web site requests using a fake referrer URL to the site the spammer wishes to advertise
  • Crawler Spam- a type of spam generated by internet bots that browse websites and log information
  • Hostname- where a visitor arrives at your website, should be the same as your domain name
  • Source – how a visitor gets to your website, made up of three different types:
    • Referrals – a link from another website not including search engines
    • Organic – a link from an unpaid search engine listing, such as a Google search
    • Direct – a visit straight to your website by typing in your URL
  • .htaccess file – a directory-level configuration file supported by several web servers, used for configuration of site-access issues, such as URL redirection, URL shortening, Access-security control (for different webpages and files), and more.
  • Robot Exclusion Standardrobot.txt or Robot Exclusion Protocol, used by websites to tell web crawlers and other web robots what parts of the website not to process or scan
  • Referral Exclusion List – a list used to prevent duplicate referrals from third party services

Links

Spammers, Crawlers and Bot Lists:

http://www.user-agents.org/

http://www.robotstxt.org/db.html

http://www.botsvsbrowsers.com/

Google Analytics Resources:

https://support.google.com/analytics/?hl=en#topic=3544906

Special thanks to Carlos Escalara for his information on Ghost and Crawler Spam.

https://www.ohow.co/what-is-referrer-spam-how-stop-it-guide/#gs.zFg7TxY

 

Back to Blog