Part Two: Setting Up Views & Blocking Ghost Spam From Google Analytics

To view Part One: Ghost and Crawler Spam click here.

To view Part Three: Blocking Crawler Spam click here.

Welcome back to Edifice Automotive’s series about removing Ghost and Crawler Spam from your Google Analytics. Last week we started off our series with an explanation of the different types of spam and how they work. If you missed that blog post you can find it here.

This week we will show you how to deal with Ghost Spam and remove it from your Google Analytics reporting. Before we do that however, we need to talk about setting up views.

Setting Up Views



Views are the level in a Google Analytics account where you can access reports and analytics tools. Views are important in organizing your Google Analytics as information in Google Analytics data cannot be altered, deleted or recovered if it is not initially recorded. In other words, if you lose part of your data due to a wrongly configured filter or setting, it will be gone forever. For this reason we use different Views to test out settings and apply different filters while still receiving all of the website’s data. Google Analytics automatically creates a filter for each of the websites you are monitoring with the name “All Website Data.” To protect your data and make sure that you do not lose anything, you will not make any changes to this view other than verifying the time and date is correct. Instead we will show you how to create three new views with the names: “Main,” “Test,” and “Unfiltered Data.” 

Google Analytics Views

 

To make things faster we will simply copy the original view, “All Website Data.” 

  1. Go to the Admin tab and select View Settings
  2. Make sure that all the information is correct such as your time and date settings. (This will save you from having to make the same changes in any subsequent copies of this view.
  3. Once you have verified that everything is in order go to Copy View.
Copy View

 

4. Change the name of the new view to Main

5. Click Copy View to confirm the copy.

6. Repeat these steps for the Test and Unfiltered views.

From now on all changes made to your Google Analytics account will be done in these views. It is best to use the Test view when attempting to set up new filters or make changes to settings that could possibly cause you to lose data. Once you have confirmed that the view does what you intended it to, you can then apply it to the Main view. As its name suggests, the Unfiltered view should not have any filters applied to it in order to see the raw data your site has collected.

Now that we have set up views for testing, it’s time to make a filter to take care of Ghost Spam

Blocking Ghost Spam

 

Last week we discussed how Ghost Spam shows up in your Google Analytics without ever visiting your site by injecting itself directly into the Google Analytics server through the measurement protocol. Since the bot has never visited your site, the Source and Hostname are fake or simply not set, as well as all of the other data that Google Analytics usually collects.

Therefore, by only allowing the entrance of visits that have a valid Hostname, you will block the entrance of all Ghost Spam. To do this you will need to find all of your Hostnames. It is imperative that you select all valid Hostnames for your site so you do not exclude any valid traffic.

How to find your Hostnames

To get a list of your Hostnames:

  1.  Go to the Reporting section of Google Analytics and select a wide timeframe on the calendar, preferably one year. (If your website is less than a year old, just select the entirety of its life.
  2. In the sidebar select Audience.
  3. Expand Technology and Select Network.
  4. At the top of the report make sure to select Hostname (Service Provider is selected by default)
Hostname

 

From here you will be brought to a table of Hostnames. From this list you will select all of the valid Hostnames that should be included in Google Analytics’ Reporting. At the very least you should see one valid Hostname: your website’s URL or Main Domain. Other Hostnames will include any services you use on your website such as external payment and translation services. Make a list of all your valid Hostnames. You can even go further and take this opportunity to leave out other Hostnames that are not spam but still non-relevant traffic, like DEV or test environments.

An INVALID Hostname is essentially any other Hostname that you don’t recognize. You may see:

  • Hostnames with the URLs of the spammer website.
  • Known pages like google.com or amazon.com  (spammers use these names to mislead people)
  • (not set) The most common Hostname for spam, this appears when the spammer doesn’t even bother to set a fake Hostname.

5. Once you gather all your valid Hostnames, you should create a Regular Expression or REGEX that includes these Hostnames. A REGEX tells Google Analytics to ignore any visits without one of the specified Hostnames.

Here are some important formatting rules for writing a Regular Expression:

  • To separate each Hostname, you should use a bar |, (this works as OR)
  • Period and the hyphen are considered special characters in REGEX so you should add a backslash \ before them.
  • Try to find a good way to match all the Hostnames, for example, if you want to match “yoururl.com,” “es.yoururl.com,” and “www.yoururl.com,” you don’t need to add all of them to the expression. Entering “yoururl,” will be enough to match all of them as long as it is not a common name.
  • Don’t leave spaces.
  • The REGEX has a limit of 255 characters. If your expression exceeds this limit, try to optimize it to keep everything under one expression as you can only have 1 active Hostname filter.
  • Don’t add a bar |, at the beginning or the end of the expression, this basically means OR everything else.

Your REGEX should look something like this:
yoururl\.com|translatingservice\.com|webcacheservice\.com|videoservice\.com|206\.190\.45\.150|shoppingcart\.com|cdn\-service\.com

It is crucial that you add all the relevant Hostnames. Otherwise, you run the risk of losing valid data.

Once you make sure the expression is correct, it’s time to create the filter.

6. Go to the Admin tab and select the Test view.

7.Select Filters on the last column “View.”

8. Select + New Filter

9. Select Create New Filter and enter Include Valid Hostnames the name.

10. In Filter Type select Custom.

11. Make sure you choose INCLUDE (you may need to scroll down a little) and select Hostname from the dropdown menu.

12. Finally, paste the Hostname REGEX that you built previously in the Filter Pattern Box.

13. Select Verify this filter. It will show a table showing you sample data of before and after applying the filter. (You should only see invalid Hostnames on the left column)

14. After making sure that no valid traffic is excluded, you can save the filter.

And that’s it! You have successfully eliminated Ghost Spam from your Google Analytics Reporting. While this soluiton does not require much maintenance, it is important to remember that every time you add your tracking-ID in any service you want to track, you need to add that service to your REGEX. You are now one step closer to unlocking Google Analytics’ true potential and having the most accurate information about your website.

 

In Closing

We hope that this post has helped you have a better understanding of how to eliminate Ghost Spam once and for all. Next week we will discuss how we can go about blocking Crawlers from your Google Analytics. In the meantime if you have any questions about Ghost and Crawler Spam or any of Edifice Automotive’s services, fill out the contact from below and we will be happy to assist you.

Resources

 

Words to Know

  • Ghost Spam- a kind of spaming that involves making repeated web site requests using a fake referrer URL to the site the spammer wishes to advertise
  • Crawler Spam- a type of spam generated by internet bots that browse websites and log information
  • Hostname- where a visitor arrives at your website, should be the same as your domain name
  • Source – how a visitor gets to your website, made up of three different types:
    • Referrals – a link from another website not including search engines
    • Organic – a link from an unpaid search engine listing, such as a Google search
    • Direct – a visit straight to your website by typing in your URL
  • .htaccess file – a directory-level configuration file supported by several web servers, used for configuration of site-access issues, such as URL redirection, URL shortening, Access-security control (for different webpages and files), and more.
  • Robot Exclusion Standardrobot.txt or Robot Exclusion Protocol, used by websites to tell web crawlers and other web robots what parts of the website not to process or scan
  • Referral Exclusion List – a list used to prevent duplicate referrals from third party services

Links

Spammers, Crawlers and Bot Lists:

http://www.user-agents.org/

http://www.robotstxt.org/db.html

http://www.botsvsbrowsers.com/

Google Analytics Resources:

https://support.google.com/analytics/?hl=en#topic=3544906

Special thanks to Carlos Escalara for his information on Ghost and Crawler Spam.

https://www.ohow.co/what-is-referrer-spam-how-stop-it-guide/#gs.zFg7TxY

 

Back to Blog