Product news

Data anonymization: a crucial issue

According to Wikipedia, Data Anonymization is « the process of either encrypting or removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous ». 

At ContentSquare, we didn’t consider encrypting since nobody should be allowed to access even-encrypted personal data in any way. So we only want to remove anything that could make users not anonymous persons.

By Adrien Mogenet, Data Engineering Manager at ContentSquare

Am I anonymous?

Making Internet users anonymous is definitely not an easy task, and we’re working hard on it. To determine whether we’re collecting personal data or not, we’re often referring to the CNIL definition (National Commission on Informatics and Liberty, an independent French administrative regulatory institution that ensures data privacy laws are applied to collected data). CNIL’s definition of privacy can be interpreted as simple as that: data must be considered as personal if they offer any way of linking a piece of information to a real-life user. No matter how complex it might be to establish links between informations. Obviously we immediately think about data such as email addresses, user logins, or fully completed postal addresses, but you can dig far deeper and then discover that privacy is sometimes larger than that.


In fact, links between data and users can be transitive and harder to imagine. For instance, if you store both birthdate and place of birth, this is considered as personal data since you can indirectly identify someone with these informations. By the way, keep in mind that dealing with personal data is not prohibited! Many companies definitely need them for their business: ticket reservations, administrative bodies, etc. But in such scenarios, you must declare everything and use these data in a controlled fashion. In a few words, you should never deal with privacy if you don’t need to, and hopefully we don’t need to link user experience informations with real-life persons at ContentSquare!

The art of collecting data

For our activities, our Javascript tracker is collecting and sending only technical data: mouse clicks, moves, scrolls, time you spend over a button or hesitating within a menu, etc. Nothing personal in this. However, there are 2 outstanding kinds of data we need to anonymize. The first one is pretty common for every SaaS editors: the IP addresses. Even if you probably can’t tell whether 78.124.54.19 is owned by Donald Knuth or Luke Skywalker, anyone with access to ISP databases, or by sending a proper request for investigation, can easily determine who owned this IP at any time. That’s why IP addresses must be processed as personal data.

The second one is specific to our users-experience-related activities: typed characters in web forms. For any textbox element in a webpage, we’d like to collect many relevant UX data: type speed; is <TAB> preferred over clicks to switch between elements?; has <backspace> been used several times? For obvious reasons, this can represent a source of personal information (first and last names, postal address, email address, phone number, etc.). We don’t collect these data so far and only consider the time spent on each element. 

Trust in a solid anonymization architecture

As you can imagine, this is not really a dedicated architecture for anonymization, but we’ve implemented this at two different stages.

 

Data Anonymization_1 fig1. Kamino and Mandalore are both internal developments (Scala/Akka), and named inspired by Star Wars universe.

 

The first stage is the Javascript tracker itself. A good place to start implementing privacy concepts! We track only relevant DOM events. As previously mentioned, you’ll find only technical artifacts (cookies for user/session detection) and technical events (browsed URI, mouse clicks, and so on. Feel free to open the developer console from your favorite browser to check that point! You’ll probably notice that the whole HTML content is sent as well, so be aware that ContentSquare subscribers are invited to evict any potential private areas from this particular tracking (e.g.. logged-in user information on the top-right corner).

OK, then no private data is explicitly sent by the Javascript tracking, but what about IP address? Obviously, IP addresses are inevitably used to establish connections between user’s browser and ContentSquare collecting service (« Kamino ») but we tend to annihilate these private information as soon as possible. As this collecting service is deployed behind a load balancer, IP address has been set within a X-Forwarded-For HTTP field. At this stage, Kamino is not doing anything, it’s only a very straightforward service that collects and encapsulates HTTP requests to an internal format, R1 (for « Raw 1 ») that will be processed further in the pipeline.

For those who don’t already know it, RabbitMQ is a message broker technology and in that case used to transfer R1 messages from Kamino (producers) to Mandalore (consumers). Mandalore’s responsibility is to convert R1 to LR1 messages (« Legal Raw 1 »), another internal format where we consider that every personal information has been removed and thus files can be safely stored and accessed by everyone at the R&D team. This is the perfect stage to remove IP address! But before removing this data, another operation is performed: geolocation. Using a local (no call to third-party database) and regularly updated database, we add country, city, and not-too-accurate coordinates to original R1 message. The “L” can be added to state we respect a legal frame, but “R” still remains as it’s important for us to deal with the rawest possible data at this step (full architecture will be described in future articles).

This technical logic has not been implemented in Kamino to avoid putting too much complexity in a rather critical service. Kamino needs to be 99.999% available, efficiently support fluctuating load and return a 200 HTTP response as fast possible, Splitting roles and responsibilities between 2 distincts components allows a great flexibility and there should be no issue to add new anonymization mechanisms in Mandalore if necessary.

Conclusion

After more than 1 year in production, this architecture and formats designs offer a very convenient way to implement anonymization at ContentSquare. As we’re collecting more and more data from different sources (smartphones offer new kind of contextual informations!) we’re constantly working on adding UX intelligence while not breaking user’s privacy!

Links

Photo by Igor Ovsyannykov

 

Share this article