Image for post
Image for post

GDPR pseudonymization techniques

This article builds on top of GDPR in Plan English.

Your personal data is any information that can be used to directly or indirectly identify you. For example: your name, home address, photo, email address, bank details, posts on social networking websites, medical information, or a computer or mobile IP address.

IP address is considered personal data. Same as email address, phone number and social security number.

Before we start, a quick disclaimer: I don’t represent my current/previous employers on my personal blog. The information provided here is purely based on my own research. I am lucky to work at an international European company that takes GDPR seriously and I had the benefit of asking questions from a privacy lawyer but this post doesn’t necessarily reflect my company’s policies, strategy or implementation of GDPR.

According to GDPR personal data should be pseudonymized in a way that it can no longer be linked (or “attributed”) to a single data subject (user) without the use of additional data.

This additional data should be kept separate from the pseudonymized data and subject to technical and organisational measures to make it hard to link a piece of data to someone’s identity (non-attribution).

Image for post
Image for post
The full data is broken to identifiable information and non-identifiable information in the pseudonymization process. The identifiable information is subjected to stricter access control and audits.

Anonymization vs pseudonymization

Pseudonymous data can still go through re-identification to link (attribute) it to an individual again.

Image for post
Image for post
Pseudonymized data can be attributed when the identity is added to the data

But while anonymous data cannot be re-identified.

Image for post
Image for post
When we cannot associate an identity with the data it is anonymized

Pseudonymization techniques

There are many ways to pseudonymize the data, which depends on the privacy impact assessment.

  • Scrambling techniques involve a mixing or obfuscation of letters. The process can sometimes be reversible. For example: “Ewerlöf” could become “Ölfeewr”
  • Encryption, which renders the original data unintelligible and the process cannot be reversed without access to the correct decryption key. The GDPR requires for the additional information (such as the decryption key) to be kept separately from the pseudonymized data.
  • A masking technique allows an important/unique part of the data to be hidden with random characters or other data. For example: “5500 0000 0000 0004” credit card number can be stored as “XXXX XXXX XXXX 0004”. The advantage of masking is the ability to identify data without manipulating actual identities.
  • Tokenization is a non-mathematical approach to protecting data at rest that replaces sensitive data with non-sensitive substitutes, referred to as tokens. The tokens have no extrinsic or exploitable meaning or value. Tokenization does not alter the type or length of data, which means it can be processed by legacy systems such as databases that may be sensitive to data length and type. That is achieved by keeping specific data fully or partially visible for processing and analytics while sensitive information is kept hidden.
  • Data blurring uses an approximation of data values to render their meaning obsolete and/or make it impossible to identify individuals. A good example is a typical blurred face in an image (though it’s not a very good idea).

Hashing vs Encryption

Hashing is a one-way transformation of data to an unreadable piece of data (hash value). Here is a super simple hashing algorithm that computes the hash value by adding the ascii values in a string:

Image for post
Image for post
Just having a hash value, there is no way you can guess the original data

As opposed to hashing, Encryption is a two-way transformation meaning that with a key the data can be decrypted and be transformed to its original form. Here is a super simple encryption algorithm that adds a certain number to ascii values in a string:

Image for post
Image for post
Data is encrypted using a key

Encryption is a reversible action and its reverse is called decryption.

Image for post
Image for post
The key can be used to decrypt the data to its original form.

If the same key can be used to encrypt and decrypt the data, it is called “symmetric encryption” which is not safe. But if a different key is used, it is called “asymmetric encryption”.

Event sourcing

In an Event Sourcing database, instead of CRUD operation you have CRAB (create, read, add, burn). Essentially you don’t delete or modify anything (similar to blockchain).

GDPR compliance for event sourcing databases is straight forward using crypto-thrashing. When throwing away the key, we effectively make it impossible to go back to the original data. Therefore the data is anonymized and falls outside the gDPR.

There is no feasible way to get back to the original data, it is now just random data.

The main problem for implementing GDPR compliance for event sourcing is when it is an after thought and it can get really ugly if it is not foreseen from the start.

If you want to learn more about GDPR implications for Blockchain technology, I got you covered. Make sure to follow me if you want to stay up to date with the latest essays.

Written by

Knowledge Worker, MSc Systems Engineering, Tech Lead, Web Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store