Data-preserving anonymization

Challenge

Our customer has a dataset with detailed information about emissions that everyone in the organization generates. In particular, the dataset contains valuable personal data, such as age, gender or team, which are important to analyze if the customer should be able to identify and communicate effective policies for reducing emissions.

However, the dataset also needs to be anonymized to protect the privacy of employees, in compliance with the strict European Privacy Regulation.

Anonymize data without sacrificing important detail

Therefore, the challenge is to design and implement a data anonymization method while also maintaining as many data details as possible.

Approach

To achieve a robust anonymization, we need a method that is established in the scientific literature as well as in practice.

Scientific literature research

Based on the system requirements, the goal is to create an anonymized dataset that in the literature falls under the definition "Privacy in non-interactive databases", and also the definition "non-sensitive data".

Non-interactive database refers the publication of anonymized data in a single public dataset. (The opposite of this is the "interactive" setting where a protected database contains original data, and anonymization happens "live" at the moment when a user requests parts of the data.)

Non-sensitive refers to the definition "Special categories of personal data" from the EU General Data Protection Law. This is concerning data whose processing could create significant risks to the fundamental rights and freedoms of affected persons, such as health related data.

Mondrian: anonymization for non-sensitive data

We settled on k-anonymization as a suitable anonymization method, and after further research decided to use an algorithm called "Basic Mondrian", a robust algorithm that minimizes data loss. It does so by allowing us to define incremental levels of anonymization for each data attribute separately, and by grouping datapoints into "k" groups of data anonymized together. (The groups, if drawn onto a canvas, look similar to the well-known paintings by Piet Mondrian, hence the name.)

Implementing the anonymization software

A search for existing Mondrian software algorithms yielded no practical results. The one open-source implementation we found had key weaknesses, such as not being able to handle missing datapoints. We therefore implemented our own algorithm. In addition, we applied a few additional sandard anonymizing techniques such as moving flight dates by a randomized number of days.

Result

The result is an anonymization algorithm that can easily be added to existing data analysis software.

Dataset anonymization is fully automated
Every data attribute has an incremental anonymization approach
It's possible to set relative priorities for which data attributes to maintain as much as possible detail for.

We have written a whitepaper which details the anonymization process, including Privacy attack scenarios that it averts. This was used in getting the green light from the customer's Privacy Office.

We are planning to adapt the algorithm's codebase so it will be ready for publishing as open-source software. We expect to gain an additional layer of security and trust from exposing its functionality.