Detecting bots: Lessons learned from watching logs
By Jimmy Ruska, 20+ years experience in tech
Bots are often ignored and left to third party services like GA, datadog or segment to deal with and fix. The truth is, these services don't fully filter out the bad traffic, and they themselves may be filtered out by Firefox strict mode and privacy sensitive browsers. In the coming age of first party data and limited third party tracking, bot tracking is an important skill for every data driven organization that needs clean data.
It's critical to understand your traffic, in order to properly assess ab testing, conversion ratios, product market fit, and other key metrics. Imagine changing your conversion funnel incorrectly after AB testing, just because several large players are trying to make the next gen language model by scraping the whole internet, and they cloak themselves to look like real traffic.
Understanding your traffic
You might go into a server check the access logs. For example, in nginx, you might stream the access log updates:
tail -f access.log
Non Javascript Rendering Bots
The first thing you might notice is a flurry of traffic from unsophisticated bots. Google bot, bing, yandex, ahrefs, semrush, etc. There's also a ton of IPs just scanning for exploits, for example exploitable wordpress plugins, .env / aws credentials, or backup.tar.gz you might have left in the root directory.
No worries here, typically most the bots you see do not trigger tracking pixels or javascript based analytics, so they won't get erroneously logged as real traffic.
There are many bots that are related to brand reputation, SEO comparison, and even data scraping to train language models. These bots are harmful, and will try to monetize their business from your content. For example SEO sites selling your information to competitor sites so they can outrank you, businesses stealing your tutorials so their language models can answer questions in your domain.
Typically you should ban the user agent of all bots that offer no value, and all IPs scanning for exploits excessively. It's almost not worth it to ban some of the exploiters, they're often scanning random cloud provider or hosting provider blocks for exploits, and you rarely see them again.
Detecting non-javascript rendering bots
- Often these bots have a HTTP referrer, which is often: blank, has “+http”, or they often mention python, java, apache or github.
- They often fake the referrer, for example sogou, and baidu, often give a referring search from their search engine, that has nothing to do with my site. Not only that but they provide query string parameters, for browser user agents that no longer pass query string parameters anymore. For privacy reasons modern browsers don't pass in the query string values of the referring site.
- They will often not have common http headers like the Accept / Accept-Language header
- They often don't store cookies you send them
- This traffic is the easiest to ignore because they don't trigger tracking pixels
Javascript Rendering Bots
You have to worry about these messing up your site statistics and conversion ratios. These bots are often running off puppeteer, selenium, or some other headless chrome method, but changing the user agent, plugins, etc to make the crawler look real. They often switch user agents at random.
Detecting sophisticated js rendering bots
- Modern browsers don't send HEAD requests, these bots often do
- They say they are Google bot, but from foreign IP addresses, and often not exactly what the latest GoogleBot reports as
- Often these bots will change the referrer to the page requested. This leads to nonsense in the logs like “The user went to the tracking pixel image, and clicked on tracking pixel link”
- These bots will read your domain SSL certificate, find all supported subdomains and crawl them. Adding a honeypot subdomain to your ssl can help detection
- These bots access robots.txt, ads.txt, humans.txt, tests for sitemap.xml or reads it from robots.txt
- They follow style="display:none" links
- They have typos in the user agent like “Mozlila”, the Russian exploit bot that floods requests
- They will visit multiple domains from the same IP or /24 subnet, in a short time period, or multiple low volume pages without any referrer.
- Their IP range comes from major cloud providers, or hosting services. Many cloud providers publish their full list of IPs you can scan against: AWS, GCP, Azure, Alibaba, OVH, Hetzner
- If they get no response they try changing user agents, or try sending multiple requests at the same time for the same page with different user agents.
- They make requests to the server IP without specifying a host
- They make requests to subdomains that don't exist, like test and QA to find sudomains that aren't cloudflare scraping protected. They also test /m/ or /mobile/ to find mobile versions of the site
Summary
Any large site with significant traffic should consider making a windowing data pipeline to filter out the bot traffic. Particularly those with heavy AB testing, or that measure rare events often on anonymous users. Protect those p-values !