Skip to main content

Improving privacy in our web analytics

We wrote recently about our aim to improve our website analytics whilst maximising the privacy. Our current approach, using web access logs (with additional anonymisation), largely suits our needs but also has limitations. It is only possible to track page loads. Sometimes it would be useful to know about other events such as the start of a video playback. It's also not possible to use on websites hosted on Github pages, which we use for quite a lot of our project-based work.

Until now we've had to rely on either not knowing if anyone visits our Github-hosted projects or including a resource from our main server (which then gets logged using our existing approach, above). It can be very tempting to add Google Analytics to these and we have done so in one instance at the request of the organisation we created a tool for. But we wondered if we could bring a privacy-first approach to these sites too. So we started making our own.

Our requirements were:

  1. privacy protecting - all identifying info such as IP addresses and User Agent strings should be anonymised as soon as possible;
  2. no cookies - to avoid the need for user-interfaces for cookie policies and opt-out forms we'd just not have them at all;
  3. low weight - any code should be as unintruisive as possible and generate minimum CO2 emissions;

So we wrote a tiny amount of Javascript (1.5kB without any minification) to send log actions to a small bit of server-based code. If people don't have Javascript enabled they don't get logged - not the end of the world and probably what they would want.

The code reports back to an endpoint on our main server. This stores data in a log on the web server for later analysis. This improves upon previous website access log approaches in the following important ways:

  1. It enables us to collect richer information about the devices, whilst discarding anything that could be used to fingerprint the end user (such as browser user agent strings)
  2. We can instrument more actions on the site, such as menu clicks or media play starts.
  3. We have complete control over where the logging data is stored, and can manage data reporting, archiving, and deletion. We know, and can confidently tell anyone using our site, that their web browsing activity will not leave our site!

We have deliberately limited the information that we capture about each interaction as, most of the time, all we're really interested in is the number of page visits (or video starts). We track the referrer, as it tells us whether the page visit originated from someone clicking around the ODI Leeds site or from an external referrer.

We decided against capturing the IP address of the browser, in part because it is potentially identifying, but more because of the issue of NAT gateways: it's impossible to use this to distinguish between users behind a corporate firewall. We do capture browser window size, as this provides us useful information about the ways in which the site is being viewed. This, more so than user agent strings, has immediate benefits in letting us know where we need to adjust the layout of pages to fit different screens.

A simple graph showing views and "plays" from January 1st 2021 to January 19th 2021
Credit: ODI Leeds Analytics
Screen sizes of visits to the ODI Leeds website. Narrow screens are 21%, medium-width are 20%, and wider screens are 60% of views.
Credit: ODI Leeds Analytics

By default we limit what we know. There could be times or situations where we did have a genuine need to identify unique visits or unique visitors - these two being progressively more privacy-encroaching. Doing so would compromise the anonymity of our website visitors and we've taken to referring to this sliding scale as the creepiness/functionality spectrum. As one increases, so does the other.

If, for example, we wanted to be able to determine the types of visits that people make to our site, we might want to associate the events that come from the same browser session. This might be achieved with browser Session Storage which is tied to a browser window and is cleared when the window closes. This could also have a timeout applied so that a new tracking identity is created if the visit is paused for a period of time. This would compromise privacy a little, but would enable richer analysis: profiling the journeys people take through the site. This could be invaluable in redesigning the user experience.

Further along the creepiness/functionality spectrum lies the idea of linking visits using a more persistent tracking identifier over a longer period of time to build a profile of a user as they visit and revisit the site.

While these thought experiments - in layering on increasing amounts of user-tracking - could allow deeper insight to the behaviours of our users, they feel too intrusive for our needs. But they are options if ever needed. However, knowing page popularity, which sites are sending people our way, and screen sizes (so we can tailor page designs better) are the basic things we need. Why collect more?

Alternatives

This aim of this post isn't to insist that you code up your own analytics; it is describing - in the open - what we have done. As we mentioned in the previous post, there are alternatives. Some of them have free, self-hosted, options - for when you want full control over where the data goes - or they have paid monthly cloud-based subscriptions. We haven't used any of these ourselves so can't vouch for them. But they show that privacy-first self-hosted and paid alternatives exist.