Detecting bots: Lessons learned from watching logs

By Jimmy Ruska, 20+ years experience in tech

Bots are often ignored and left to third party services like GA, datadog or segment to deal with and fix. The truth is, these services don't fully filter out the bad traffic, and they themselves may be filtered out by Firefox strict mode and privacy sensitive browsers. In the coming age of first party data and limited third party tracking, bot tracking is an important skill for every data driven organization that needs clean data.

It's critical to understand your traffic, in order to properly assess ab testing, conversion ratios, product market fit, and other key metrics. Imagine changing your conversion funnel incorrectly after AB testing, just because several large players are trying to make the next gen language model by scraping the whole internet, and they cloak themselves to look like real traffic.

Understanding your traffic

You might go into a server check the access logs. For example, in nginx, you might stream the access log updates:

tail -f access.log

 

 

Non Javascript Rendering Bots

The first thing you might notice is a flurry of traffic from unsophisticated bots. Google bot, bing, yandex, ahrefs, semrush, etc. There's also a ton of IPs just scanning for exploits, for example exploitable wordpress plugins, .env / aws credentials, or backup.tar.gz you might have left in the root directory. 

No worries here, typically most the bots you see do not trigger tracking pixels or javascript based analytics, so they won't get erroneously logged as real traffic. 

There are many bots that are related to brand reputation, SEO comparison, and even data scraping to train language models. These bots are harmful, and will try to monetize their business from your content. For example SEO sites selling your information to competitor sites so they can outrank you, businesses stealing your tutorials so their language models can answer questions in your domain.

Typically you should ban the user agent of all bots that offer no value, and all IPs scanning for exploits excessively. It's almost not worth it to ban some of the exploiters, they're often scanning random cloud provider or hosting provider blocks for exploits, and you rarely see them again.

Detecting non-javascript rendering bots

 

Javascript Rendering Bots

You have to worry about these messing up your site statistics and conversion ratios. These bots are often running off puppeteer, selenium, or some other headless chrome method, but changing the user agent, plugins, etc to make the crawler look real. They often switch user agents at random.

Detecting sophisticated js rendering bots

Summary

Any large site with significant traffic should consider making a windowing data pipeline to filter out the bot traffic. Particularly those with heavy AB testing, or that measure rare events often on anonymous users. Protect those p-values !