DataSunrise is sponsoring RSA Conference2024 in San Francisco, please visit us in DataSunrise's booth #6178

GDPR Data Discovery

GDPR Data Discovery


In today’s data-driven world, organizations handle vast amounts of personal information. The GDPR in the EU requires businesses to be proactive about data compliance. A key part of following GDPR rules is finding sensitive data in a company’s systems, known as data discovery. In this article, we will explore the basics of GDPR data discovery, discuss the types of sensitive data specific to GDPR, and introduce open-source tools that can assist in this process.

What is GDPR Data Discovery?

GDPR data discovery is the process of identifying, classifying, and mapping personal data across an organization’s IT infrastructure. It involves locating sensitive information stored in databases, file systems, cloud storage, and other data repositories. Data discovery aims to understand the location of personal data, identify who can access it.

Effective data discovery is essential for GDPR compliance as it enables organizations to:

  • Identify and catalog personal data
  • Assess potential risks and vulnerabilities
  • Implement appropriate security measures
  • Respond to data subject access requests (DSARs)
  • Demonstrate compliance to regulatory authorities

Sensitive Data Specific to GDPR

GDPR defines personal data as any information relating to an identified or identifiable natural person. However, some categories of personal data are particularly sensitive and require additional protection. These special categories of sensitive data include:

  • Racial or ethnic origin
  • Political opinions
  • Religious or philosophical beliefs
  • Trade union membership
  • Genetic data
  • Biometric data (for uniquely identifying a person)
  • Health data
  • Data concerning a person’s sex life or sexual orientation

Organizations must take extra precautions when processing these types of sensitive data, such as obtaining explicit consent from individuals and implementing strict access controls.

Where to Find Sensitive Data

You can find sensitive data across various systems within an organization, making it challenging to locate and manage. Common places where sensitive data may reside include:

  • Structured databases (e.g., MySQL, PostgreSQL)
  • Unstructured data sources (e.g., emails, documents)
  • Cloud storage platforms (e.g., AWS S3, Google Cloud Storage)
  • Backup files and archives
  • Application logs and audit trails

To effectively discover sensitive data, organizations need to perform a thorough inventory of their data assets and map out the flow of personal information across their systems.

Open-Source Tools for GDPR Data Discovery

Several open-source tools can assist organizations in their GDPR data discovery efforts. These tools provide capabilities such as data classification, pattern matching, and metadata extraction. Some popular open-source tools for data discovery include:

  1. Apache Ranger: Apache Ranger is a framework for enabling, monitoring, and managing comprehensive data security across the Hadoop platform. It provides a centralized platform for defining and enforcing fine-grained access control policies.
  2. ElasticSearch: ElasticSearch is a distributed search and analytics engine for log analysis, full-text search, and data discovery. Its powerful query language allows organizations to search and analyze large volumes of data quickly.
  3. Talend Open Studio for Data Quality: Talend Open Studio (retired on January 31, 2024) for Data Quality is an open-source data profiling and cleansing tool. It provides features for data discovery, data matching, and data standardization, helping organizations ensure the quality and consistency of their data.

When using these tools, it’s important to configure them according to your organization’s specific needs and data landscape. For example, you may need to define custom patterns or regular expressions to identify sensitive data unique to your industry or create specific data quality rules to validate and standardize your data.

Example: Discovering Sensitive Data in a Hadoop Cluster

Let’s consider an example scenario where an organization wants to use Apache Ranger to discover and protect sensitive data stored in a Hadoop cluster. To begin, they would need to set up Apache Ranger and integrate it with their Hadoop environment.

Once Apache Ranger is installed and configured, the organization can define policies to classify and tag sensitive data. For example, they can create a policy that tags columns containing credit card numbers as “PCI Sensitive.” Here’s an example policy definition in Apache Ranger:

jsonCopy code{
  "policyName": "Credit Card Policy",
  "resources": {
    "database": {
      "values": ["finance"],
      "isExcludes": false,
      "isRecursive": false
    "table": {
      "values": ["transactions"],
      "isExcludes": false,
      "isRecursive": false
    "column": {
      "values": ["credit_card_number"],
      "isExcludes": false,
      "isRecursive": false
  "policyLabels": ["PCI Sensitive"],
  "description": "Policy to classify credit card numbers as sensitive"

In this policy, Apache Ranger is configured to tag the “credit_card_number” column in the “transactions” table of the “finance” database as “PCI Sensitive.” This classification helps identify sensitive data and enables the organization to apply appropriate access controls and security measures.

With the policy in place, Apache Ranger will continuously monitor access to the specified resources and enforce the defined policies. It can generate reports and audit trails, providing visibility into who is accessing sensitive data and helping demonstrate compliance with GDPR requirements.

Summary and Conclusion

GDPR data discovery is a critical process for organizations striving to achieve data compliance. By identifying and locating sensitive data within their systems, businesses can take the necessary steps to protect personal information and meet GDPR requirements.

We discussed the importance of data discovery, the types of sensitive data specific to GDPR, and where this data can typically be found. We included free tools to help with data discovery. These tools are Apache Ranger, ElasticSearch, and Talend Open Studio for Data Quality.

Remember, data discovery is an ongoing process that requires regular reviews and updates as an organization’s data landscape evolves. Organizations can enhance their data governance by using good data discovery practices and the right tools. This can help reduce risks and build customer trust. Good data discovery practices and the right tools are key to achieving these benefits.

DataSunrise: User-Friendly and Flexible Tools for Data Discovery and Compliance

Open-source security tools may lack regular updates, comprehensive support, and extensive documentation compared to commercial solutions. They often require more technical expertise to configure and maintain effectively, which can be challenging for organizations with limited resources or technical skills.

DataSunrise offers a comprehensive suite of tools for database security, data discovery (including OCR), and compliance. With its user-friendly interface and flexible configuration options, DataSunrise empowers organizations to effectively discover, protect, and govern their sensitive data.

To find out how DataSunrise can help your organization follow GDPR rules and improve data security, check out our online demo. Our experts will happily showcase the powerful features of DataSunrise and demonstrate how they can tailor it to your specific needs.


Agile Data Governance

Agile Data Governance

Learn More

Need Our Support Team Help?

Our experts will be glad to answer your questions.

General information:
[email protected]
Customer Service and Technical Support:
Partnership and Alliance Inquiries:
[email protected]