According to Wikipedia, Data Anonymization is « the process of either encrypting or removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous ». At ContentSquare, we didn’t consider encrypting since nobody should be allowed to access even-encrypted personal data in any way. So we only want to remove anything that could make users not anonymous persons.
By Adrien Mogenet, Data Engineering Manager at ContentSquare
Am I anonymous?
Making Internet users anonymous is definitely not an easy task, and we’re working hard on it. To determine whether we’re collecting personal data or not, we’re often referring to the CNIL definition (National Commission on Informatics and Liberty, an independent French administrative regulatory institution that ensures data privacy laws are applied to collected data). CNIL’s definition of privacy can be interpreted as simple as that: data must be considered as personal if they offer any way of linking a piece of information to a real-life user. No matter how complex it might be to establish links between informations. Obviously we immediately think about data such as email addresses, user logins, or fully completed postal addresses, but you can dig far deeper and then discover that privacy is sometimes larger than that.
In fact, links between data and users can be transitive and harder to imagine. For instance, if you store both birthdate and place of birth, this is considered as personal data since you can indirectly identify someone with these informations. By the way, keep in mind that dealing with personal data is not prohibited! Many companies definitely need them for their business: ticket reservations, administrative bodies, etc. But in such scenarios, you must declare everything and use these data in a controlled fashion. In a few words, you should never deal with privacy if you don’t need to, and hopefully we don’t need to link user experience informations with real-life persons at ContentSquare!
The art of collecting data
The second one is specific to our users-experience-related activities: typed characters in web forms. For any textbox element in a webpage, we’d like to collect many relevant UX data: type speed; is <TAB> preferred over clicks to switch between elements?; has <backspace> been used several times? For obvious reasons, this can represent a source of personal information (first and last names, postal address, email address, phone number, etc.). We don’t collect these data so far and only consider the time spent on each element.
Trust in a solid anonymization architecture
As you can imagine, this is not really a dedicated architecture for anonymization, but we’ve implemented this at two different stages.
For those who don’t already know it, RabbitMQ is a message broker technology and in that case used to transfer R1 messages from Kamino (producers) to Mandalore (consumers). Mandalore’s responsibility is to convert R1 to LR1 messages (« Legal Raw 1 »), another internal format where we consider that every personal information has been removed and thus files can be safely stored and accessed by everyone at the R&D team. This is the perfect stage to remove IP address! But before removing this data, another operation is performed: geolocation. Using a local (no call to third-party database) and regularly updated database, we add country, city, and not-too-accurate coordinates to original R1 message. The “L” can be added to state we respect a legal frame, but “R” still remains as it’s important for us to deal with the rawest possible data at this step (full architecture will be described in future articles).
This technical logic has not been implemented in Kamino to avoid putting too much complexity in a rather critical service. Kamino needs to be 99.999% available, efficiently support fluctuating load and return a 200 HTTP response as fast possible, Splitting roles and responsibilities between 2 distincts components allows a great flexibility and there should be no issue to add new anonymization mechanisms in Mandalore if necessary.
After more than 1 year in production, this architecture and formats designs offer a very convenient way to implement anonymization at ContentSquare. As we’re collecting more and more data from different sources (smartphones offer new kind of contextual informations!) we’re constantly working on adding UX intelligence while not breaking user’s privacy!