Privacy-first web analytics

What are web analytics? Many website owners want to know which pages on their website are popular and which are not. That way they can improve their site, make it easier for people to find the things they want them to, and generally know if anyone is visiting.

Back in the early days of the web you'd find out your visitor numbers by looking at your server web logs or even via web counters. But many website owners either didn't have access to their own web logs or considered them difficult to use. Then, in 2005, there was the arrival of Google Analytics.

Google Analytics (others are available but GA is used by so many websites) is free and easy to use as long as you insert some Google code into your webpages. It produces nice graphs and gives the ability to look in more detail at things e.g. it can tell you the geographic location of your visitors more reliably. But Google Analytics comes with trade-offs. The main one being that in order to provide the service Google gets to know about all your site visitors. And because they are used on so many sites, they can track those visitors across a lot of the web. That isn't great for the privacy of your website visitors. And opting out from this isn't the easiest thing for people. You really should provide a straightforward opt-out mechanism to comply with GDPR and just to do the right thing by your visitors.

Privacy-first server logs

At ODI Leeds we try to do a good job on privacy. We do our best not to use cookies. We use our own web logs for counting visitor numbers on our website. We even take extra steps to anonymise what our web logs contain. Most web logs will store IP address and what is called the User Agent string (e.g. that you use Edge/Chrome/Firefox, the version number, your operating system etc) provided by your web browser. Both can be personally identifying. But we also don't want to count every page you visit on our website as a separate "visitor" - we'd get a very wrong impression of how many people visited. So we have a compromise.

Each day we go through our weblogs and replace every IP address with a visitor number for that day. Say I was the fifth visitor that day: we'd replace every instance of my IP address with the number "5". We then throw away all the original IP addresses in that day's log. We then have no way of going back to your IP address. We also chop down the User Agent string to make it much less unique. We don't need to know that you are using "Chrome 87.0.4280.88 (Official Build) (64-bit)" when "Chrome 87" is good enough for us to know how we may need to adjust our site, technically.

These techniques mean we can extract "visits" and page "views" by time and date which is the vast majority of what we want to know whilst making sure we try to know as little about our site visitors as possible. This means we follow the intent of GDPR as we are doing our best to limit what we collect to what is strictly neccessary and where we do get identifying information we make it anonymous pretty promptly.

Alternatives to Google Analytics

But if you don't have your access to your server logs there are alternatives. After mentioning this topic on Twitter, several people sent suggestions of companies that already provide privacy-first analytics as a service. Some of them have free, self-hosted, options - for when you want full control over where the data goes - or they have paid monthly cloud-based subscriptions. As several of them explicitly mention, their business model is selling software to you not selling your visitor's data (or insight from it) to others. We haven't used any of these ourselves so can't vouch for them. But they show that privacy-first alternatives exist.

There are plenty more options out there too so don't see this as a definitive or "best" list. The more companies providing thoughtful privacy-first alternatives the better.

We are experimenting with improving our own privacy-preserving logging. Expect a follow up post soon.