28-09-2016 Door: Christoph Balduck

Is big data processing the silver bullet to become compliant with the GDPR?

Deel dit bericht

Although the GDPR took several years before getting finalized and approved, the new European Data Privacy and Data Protection regulation is now starting to get traction and is put on the agenda of an increasing number of companies and organizations.

Most companies have therefore started their search for a data protection officer, and some have added a CISO (chief information security officer) role to that. The DPO is preferably a legal person with IT and data/information management knowledge or vice versa and should report to a C-level resource, the board or the equivalent of that.

The implications of the GDPR however are both wide and deep and require DPO’s to plan & coordinate not only the extensive legal and organizational efforts, but also the widespread architectural and technical requirements, as described in the 200+ pages legal GDPR document.
Besides a number of challenging topics such as privacy by design and privacy by default – which are starting to become nightmares for large “spaghetti-integrated” organizations - there are requirements such as the right to be forgotten, the right to access and obtain your personal data and the right to have your personal data rectified.  

These last set of requirements – along with the data portability requirement, require organizations to get full control over all the personal data they have (even if that’s stored away in a paper document, a picture, video or audio file, an xls or word file or an unused legacy solution). Determining the scope of personal data an organization obtains and processes can therefore be a lengthy discovery exercise.  
As personal data includes all data that can be used to identify a data subject, it often comes down to scanning hundreds or thousands of official solutions and a multitude of that when it comes to to unofficial data artifacts.  

Once discovered, that personal data must be of good quality, consistent, traceable, governed and it must be linked to a specific data subject. Furthermore the storage and processing of personal data must adhere to the principles of privacy by design and by default and an organization must be able to provide all (his/her) personal data to any data subject asking for it (within one (max two) months).

As if this is not enough, organizations must also be able to provide the details about consent or other legal grounds (such as legitimate interest, public interest,…) for processing personal data.
The requirements above are only a part of the story, as there’s also the need for controlling the end-to-end personal data lifecycle, adapting the (project/change) methodology with DPIA’s (data protection impact assessments), extensive training and awareness, alignment with processors and controllers, extensive procedures and measures around personal data breaches etc.

Questions asked by Data privacy & protection, IT and data executives

We can either give up and move to an island where no personal data of EU citizens is stored or processed or… determine what we can reuse of our existing business and IT capabilities and look out for new accelerating technologies that help us in becoming compliant. Complying with the requirements of the data subject rights (data access, rectification, deletion, portability,…) is one of the hardest challenges IT and data & information management teams will face in the next couple of months and years (deadline is 25/05/2018).

Some of the main questions that will asked by DPO’s, CIO’s and CDO’s include:

  • How do we get to know the scope of our personal data without having to run through all solutions extensively?
  • How do we get insight into our personal data (discovery)?
  • How do we extract personal data in a consistent manner and how do we make sure we can provide the data subject with his/her high quality personal data?
  • How do we rectify or delete personal data without impacting business processes or causing technical disturbances?

In order to cope with the requirements above a reference architecture was created which will be explained in depth in the Adeptevents workshops in 2016 and 2017 (Datamanagement en de EU Privacy en Databescherming wetgeving). This reference architecture includes the typical data management components such as data quality, metadata management, business glossary etc. and positions Master Data Management (MDM) in such a way that requirements can be adhered to, to a large extend.

Can Big data processing bring the solution?

MDM solutions have evolved extensively overtime, but are starting to be challenged (in some cases) by big data processing solutions. Especially when it comes to the creation of “a single view”, these big data processing platforms/solutions allow an organization to connect a wide variety of data sources.
In most cases the “Big” data is gathered into either a NOSQL db or a Hadoop store and insight is provided by means of an appropriate data processing platform and specific tooling on top.

We can see vendors such as SAS, Cloudera, Hortonworks (Hadoop), but also MongoDB, … etc. (NOSQL) expanding their offering and positioning themselves as a fast alternative for providing insight into a the data of complex landscapes. The vendors and platforms described require additional technology and coding on top (Spark, R, Hive, Pig,…), but all in all we’ve come across cases where single views were created in a matter of months.

What if we could use this big data technology to dump all of the data of various sources and solutions (of which we believe their might be personal data stored) and gather insights on the use and storage of personal data quickly? Would we be able to answer all of the questions above?

• How do we get to know the scope of our personal data without having to run through all solutions extensively & how do we get insight into our personal data (discovery)?
Dumping all data into a big data solution will provide you with insights on the width and depth of personal data.
Con: Low data quality, the lack of information governance and/or unmanaged high semantical variety of the data will reduce the quality of your insights.  
E.g.: you might end up with more data-subjects than you actual have personal information about,  you might end up with tagging data as personal data while it’s not (and  vice versa), you could be linking personal data to the wrong data subjects and risk a data breach upon providing that data,… .
Prereq.: A business analyst will be required to interpret the outcomes of the insights and validate them. Data preparation is preferably performed on the data sources before ingesting them.  
The higher your data & information mgt. maturity the more value big data processing can bring in becoming compliant with the GDPR.

• How do we extract personal data in a consistent manner and how do we make sure we can provide the data subject with his/her high quality personal data?
Unless your organization has a very high maturity in terms of data quality, data & information governance and data & information management as a whole, the use of a big data solution will probably not cover this question sufficiently (for the majority of organizations).
Con: We sometimes hear the statement of “If we put in enough data, we don’t need to worry about the quality of the source-data, the volume makes sure the quality of the data/datalake will eventually be good” – which is not (always) the case & which is a risk you can not afford to take in the light of the GDPR or any other legal/compliance requirements.
Prereq.: High data & information mgt. maturity.

• How do we rectify or delete personal data without impacting business processes or causing technical disturbances?
Big data processing can support this heavy process of rectification or deletion by monitoring/discovering the use of outdated or incorrect personal data and provide proof to DPO’s and auditors that personal data was rectified or deleted.
Con: Integration towards the data lake or big data store must be complete (all sources must be ingested) – otherwise wrongful personal data might stay hidden in the landscape.

Big data solutions can help in the discovery and monitoring/controlling/auditing of the GDPR – but most organizations will still need to increase their data and information management capabilities fast in order to be able to use these Big Data solutions as an accelerator and to become overall compliant.
Last but not least: remember that data of “traditional” data lakes can also be categorized as personal data and must also comply with the regulation.

If you want to know more about the GDPR and becoming compliant by means of (re-)using data and information management capabilities, feel free to subscribe to the Adept Events website.

Christoph Balduck

Christoph Balduck is sinds 2001 werkzaam in IT en vanaf 2007 werkzaam in het gebied van information management. Initieel vervulde Christoph een breed aantal technische en functionele rollen in SAP, waarna hij zich toelegde op de CRM toepassingen om zich vanaf 2007 toe te leggen op informatie- en datamanagement.

Christoph is senior practitioner gespecialiseerd in Data privacy en data protection, Master Data Management, Information & Data Governance, Data Quality, Information Strategy en Information architecture. Voorts is Christoph gecertificeerd als EU Data Protection Officer. Momenteel werkt Christoph als hoofd Data Management van de Ageas Groep waar hij zich o.a. bezig houdt met Data Privacy en data protection, maar ook met informatie strategie en data- en informatie governance, master- en reference data management, informatiearchitectuur (als deel van business- en enterprise architectuur), data kwaliteit en metadata management. Christoph is lid van DAMA Belux en van de General Council van de Data Quality Association.


Alle blogs van deze auteur