Tackle data privacy with an intelligent data catalog

Deel dit bericht

Data breaches have far reaching consequences. They pose a significant financial cost in lost business, fines, and remediation, often averaging 3.92 million USD according to a study by the Ponemon Institute. Their impact on an organization's reputation spans many years. An organization's first step in protecting itself against breaches is identifying its personal data that needs to be safeguarded.

Personal data protection regulations require that entities which collect an individual’s data are able to identify, protect and use it only for purposes that the data was collected for. Enterprises collect a tremendous amount of data from a variety of sources and any of these data sources could potentially contain personal data. Data is often relocated for warehousing, reporting, analytics, storage, testing, and application use, therefore AI models could potentially be copied multiple times over, resulting in potential perforation of personal data across the enterprise. Gartner predicts that by the end of 2020, the backup and archiving of personal data will represent the largest area of privacy risk for 70 percent of organizations, up from 10 percent in 2018.  

In order to understand the amount of personal data in the enterprise, it is important to examine the entire data landscape of the enterprise. Periodic re-evaluation is necessary to mitigate privacy risks. Enterprises need to protect personal data and make sure all regulatory requirements towards its lifestyle and correct usage are met. In order to achieve this, it is important to make sure all possible data stores are examined to determine if they contain personal data. This is an operation that needs to be done at scale to cover millions of data assets and repeated with confidence.

The following is a three-step process to discover and protect your sensitive data:
1. Create a glossary: A glossary contains terms that define and describe them to ensure there is clarity around what is personal data, what characteristics could make data personal, and how to identify them. A glossary should be a live and continuously updated document to keep up with updates to existing regulations or new ones an enterprise must adhere to.
2. Identify patterns: Common patterns that represent potential personal data should be documented. These are then used to classify data and match with a term from the glossary created in the first step.
3. Tag assets: Use the taxonomy and the common patterns to connect a term in the glossary with a physical asset. For privacy regulations, it is imperative that every data store is cataloged and tagged to denote if it contains personal data.

However, the process of connecting a term to a physical asset is labor intensive, time consuming and needs to be repeated each time a new data store is added to the enterprise’s data landscape. When updating the taxonomy in response to a regulation, the ability to perform updates quickly is key to enabling an enterprise to respond immediately to compliance asks.

IBM Watson Knowledge Catalog services on Cloud Pack for Data addresses this problem by using parallel processing to scan large amount of assets via both a rule-based and cognitive approach to automate the task of connecting a term to a physical asset. The data stewards serve as subject matter experts and have the final say, as well as any corrections provided by indivuals to improve the reliability of cognitive approach. An organization is able to use Watson Knowledge Catalog to scan large assets, catalog them, and allow for the enterprise to make only the non-sensitive assets available to its data users. Thanks to its business user friendly data catalog and data shaping capabilities, it also streamlines the use of data by data scientists, thus ensuring no sensitive data is used. In a well-governed organization, the catalog plays a vital role in cataloging and governing models as well. Watson Knowledge Catalog ensures the administration of AI models by governing the data used to create the models.
Learn more about IBM Watson Knowledge Catalog.

Sundari Vorunganti is Development Manager Cloud Pak for Data at IBM.