Data Centers Host Hidden Vermin.
As an experiment I set up a web site that would never get any real human traffic beyond the stray person following some link in twitter. When I say the site has no human traffic, we’re talking ghost town, if there’s more than a couple of visitors a day it’s amazing. However, that was the intent so that it was easy to spot automated traffic vs. human traffic as it was 99.999% automated visitors.
Blacklists are ineffective.
Many people still use simple lists of known user agents to block bad traffic coming to their servers but when that bad traffic doesn’t want to be caught in those simple lists it uses the same user agent as a browser. As a matter of fact, more bad traffic now identifies itself as a browser more than ever before because they want access to your site without being easily stopped.
The Smoking Gun. Seeing is Believing.
The report below shows the activity coming from 126 data centers and fake browser user agents are highlighted in various colors for each browser user agent. The data was filtered to show each unique occurrence of a user agent coming from a host. There may have been many repeat visits, many IPs used, etc. but it’s just too much data to include in this report as the purpose here is just to show what kind of user agents are coming from data centers where there are no humans, only servers.
One caveat is the bots allowed access to the site, such as googlebot, bingbot, etc., won’t show up on this report as we were only looking for stuff hiding in the woodworks that didn’t want to be identified. If you see something that identifies itself as googlebot or one of the other well known crawlers it was either a fake or the real thing crawling through a proxy which, if allowed, could result in proxy hijacking of a site in some search engines. Don’t be confused by just anything claiming to be googlebot because scrapers do this to bypass blacklists and some bot blockers because googlebot and the others are typically allowed access which makes their user agents a vulnerability to many simplistic bot blockers.
User Agent Report by Host
Click to view User Agent by Host in a new tab.
Tip of the iceberg.
There was a lot of additional activity on the site not in the report, including other hosts that aren’t currently on my block list because they have mixed use of their IP ranges, such as DSL services, and they can’t be easily sorted out from the hosting IPs. Additionally, there was automated activity being tracked from IPs that are assigned to residential usage on various ISPs also not included in the report. This is just a specific example of all the bad activity going on at hosting centers which is about half of the bad action now that many scrapers have high speed cable they don’t need to do their dirty deeds from a server which makes them even harder to block.
What are all these bots doing?
Here’s a brief list of what some of these browser user agents are actually doing on your site.
- Data mining
- Brand monitoring
- Copyright monitoring
- Reputation monitoring
- SEO Tools
- Screen Shots / Page Preview
- Web proxies
- Domain intelligence
- And more, all parasitic
Blacklists don’t work, waste of time.
What we can take away from this is that blacklists are a big waste of time because they only stop traffic that makes itself easy to identify which isn’t the goal of most as they want what they want without anyone interfering. Additionally, blacklists only stop user agents already known and some scrapers send random user agents so you can never stop them with a blacklist, never.
Blocking data centers is the only way.
They’re profiting off your hard work for free and often intercepting your profits too when they outrank your site for your own content so ignoring the situation isn’t an option.
Prohibiting all port 80 activity coming from those data centers is the only effective way to stop all of the unwanted traffic. Remember, we’re only talking about port 80 blocking here so you won’t interfere with email or anything else as long as you only block port 80. If there happens to be something you want to allow access, such as a new search engine or some other site, exceptions can be made to the port 80 firewall by whitelisting specific IPs and user agents ahead of the data center block list allowing everything to operate without any problems.
Hope this explains the situation and shows that activity from data centers is a more substantial problem that most are willing to admit. The vermin are out there are like cockroaches on the internet feeding on all your sites daily and they are pretty unstoppable unless you start blocking data centers.